Menu Home

A budget of classifier evaluation measures

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?”

My concrete advice is:

  • Read Nina Zumel’s excellent series on scoring classifiers.
  • Keep notes.
  • Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you want a flexible score) and “deviance” late in a project (when you want a strict score).
  • When working on practical problems work with your business partners to find out which of precision/recall, or sensitivity/specificity most match their business needs. If you have time show them and explain the ROC plot and invite them to price and pick points along the ROC curve that most fit their business goals. Finance partners will rapidly recognize the ROC curve as “the efficient frontier” of classifier performance and be very comfortable working with this summary.

That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.


Stanley Wyatt illustration from “Mathmanship” Nicholas Vanserg, 1958, collected in A Stress Analysis of a Strapless Evening Gown, Robert A. Baker, Prentice-Hall, 1963

The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.


Lets define our terms. We are going to work with “binary classification” problems. These are problems where we have example instances (also called rows) that are either “in the class” (we will call these instances “true”) or not (and we will call these instances “false”). A classifier is a function that given the description of an instance tries to determine if the instance is in the class or not. The classifier may either return a decision of “positive”/“negative” (indicating the classifier thinks the instance is in or out of the class) or a probability score denoting the estimated probability of being in the class.

Decision or Hard Classifiers

For decision based (or “hard”) classifiers (those returning only a positive/negative determination) the “confusion matrix” is a sufficient statistic in the sense it contains all of the information summarizing classifier quality. All other classification measures can be derived from it.

For a decision classifier (one that returns “positive” and “negative”, and not probabilities) the classifier’s performance is completely determined by four counts:

  • The True Positive count, this is the number of items that are in the true class that the classifier declares to be positive.
  • The True Negative count, this is the number of items that in the false class that the classifier declares to be negative.
  • The False Positive count, this is the number of items that are not in the true class that the classifier declares to be positive.
  • The False Negative count, this is the number of items in the true class the that classifier declares to be negative.

Notice true and false are being used to indicate if the classifier is correct (and not the actual category of each item) in these terms. This is traditional nomenclature. The first two quantities are where the classifier is correct (positive corresponding to true and negative corresponding to false) and the second two quantities count instances where the classifier is incorrect.

It is traditional to arrange these quantities into a 2 by 2 table called the confusion matrix. If we define:

## Loading required package: lattice
## Loading required package: rJython
## Loading required package: rJava
## Loading required package: rjson
A = Var('TruePositives')
B = Var('FalsePositives')
C = Var('FalseNegatives')
D = Var('TrueNegatives')

(Note all code shared here.)

Then the caret R package defines the confusion matrix as follows (see help("confusionMatrix")) as follows:

Predicted   Event   No Event
     Event  A       B
  No Event  C       D

Reference is “ground truth” or actual outcome. We will call examples that have true ground truth “true examples” (again, please don’t confuse this with “TrueNegatives” which are “false examples” that are correctly scored as being false. We would prefer to have the classifier indicate columns instead of rows, but we will use the caret notation for consistency.

We can encode what we have written about these confusion matrix summaries as algebraic statements. Caret’s help("confusionMatrix") then gives us definitions of a number of common classifier scores:

# (A+C) and (B+D) are facts about the data, independent of classifier.
Sensitivity = A/(A+C)
Specificity = D/(B+D)
Prevalence = (A+C)/(A+B+C+D)
PPV = (Sensitivity * Prevalence)/((Sensitivity*Prevalence) + ((1-Specificity)*(1-Prevalence)))
NPV = (Specificity * (1-Prevalence))/(((1-Sensitivity)*Prevalence) + ((Specificity)*(1-Prevalence)))
DetectionRate = A/(A+B+C+D)
DetectionPrevalence = (A+B)/(A+B+C+D)
BalancedAccuracy = (Sensitivity+Specificity)/2

We can (from our notes) also define some more common metrics:

TPR = A/(A+C)     # True Positive Rate
FPR = B/(B+D)     # False Positive Rate
FNR = C/(A+C)     # False Negative Rate
TNR = D/(B+D)     # True Negative Rate
Recall = A/(A+C)
Precision = A/(A+B)
Accuracy = (A+D)/(A+B+C+D)

By writing everything down it becomes obvious that Sensitivity==TPR==Recall. That won’t stop somebody from complaining if you say “recall” when they prefer “sensitivity”, but that is how things are.

By declaring all of these quantities as sympy variables and expressions we can now check much more. We confirm formal equality of various measures by checking that their difference algebraically simplifies to zero.

# Confirm TPR == 1 - FNR
## [1] "0"
# Confirm Recall == Sensitivity
## [1] "0"
# Confirm PPV == Precision
## [1] "0"

We can also confirm non-identity by simplifying and checking an instance:

# Confirm Precision != Specificity
expr <- sympy(paste("simplify(",Precision-Specificity,")"))
## [1] "(FalsePositives*TruePositives - FalsePositives*TrueNegatives)/(FalsePositives*TrueNegatives + FalsePositives*TruePositives + TrueNegatives*TruePositives + FalsePositives**2)"
sub <- function(expr,
                TruePositives,FalsePositives,FalseNegatives,TrueNegatives) {

## [1] -0.5

More difficult checks

Balanced Accuracy

If we write the probability of a true (in-class) instances scoring higher than a false (not in class) instance (with 1/2 point for ties) as Prob[score(true)>score(false)] (with half point on ties). We can then confirm Prob[score(true)>score(false)] (with half point on ties) == BalancedAccuracy for hard or decision classifiers by defining score(true)>score(false) as:

A D : True Positive and True Negative: Correct sorting 1 point
A B : True Positive and False Positive (same prediction "Positive", different outcomes): 1/2 point
C D : False Negative and True Negative (same prediction "Negative", different outcomes): 1/2 point
C B : False Negative and True Negative: Wrong order 0 points

Then ScoreTrueGTFalse ==Prob[score(true)>score(false)] (with 1/2 point for ties)` is:

ScoreTrueGTFalse = (1*A*D  + 0.5*A*B + 0.5*C*D + 0*C*B)/((A+C)*(B+D))

Which we can confirm is equal to balanced accuracy.

## [1] "0"


We can also confirm Prob[score(true)>score(false)] (with half point on ties) == AUC. We can compute the AUC (the area under the drawn curve) of the above confusion matrix by referring to the following diagram.



Then we can check for general equality:

AUC = (1/2)*FPR*TPR + (1/2)*(1-FPR)*(1-TPR) + (1-FPR)*TPR
## [1] "0"

This AUC score (with half point credit on ties) equivalence holds in general (see also More on ROC/AUC, though I got this wrong the first time).


We can show F1 is different than Balanced Accuracy by plotting results they differ on:

# Wikipedia
F1 = 2*Precision*Recall/(Precision+Recall)
F1 = sympy(paste("simplify(",F1,")"))
## [1] "2*TruePositives/(FalseNegatives + FalsePositives + 2*TruePositives)"
## [1] "TrueNegatives/(2*(FalsePositives + TrueNegatives)) + TruePositives/(2*(FalseNegatives + TruePositives))"
# Show F1 and BalancedAccuracy do not always vary together (even for hard classifiers)
F1formula = parse(text=F1)
BAformula = parse(text=BalancedAccuracy)
frm = c()
for(TotTrue in 1:5) {
  for(TotFalse in 1:5) {
    for(TruePositives in 0:TotTrue) {
      for(TrueNegatives in 0:TotFalse) {
        FalsePositives = TotFalse-TrueNegatives
        FalseNegatives = TotTrue-TruePositives
        F1a <- sub(F1formula,
        BAa <- sub(BAformula,
        if((F1a<=0)&&(BAa>0.5)) {
        fi = data.frame(
          stringsAsFactors = FALSE)
        frm = rbind(frm,fi) # bad n^2 accumulation

ggplot(data=frm,aes(x=F1,y=BalancedAccuracy)) + 
  geom_point() + 
  ggtitle("F1 versus balancedAccuarcy/AUC")


F1 versus BalancedAccuracy/AUC

Baroque measures

In various sciences over the years over 20 measures of “scoring correspondence” have been introduced by playing games with publication priority, symmetry, and incorporating significance (“chance adjustments”) directly into the measure.

Each measure presumably exists because it avoids flaws of all of the others. However the sheer number of them (in my opinion) triggers what I call “De Morgan’s objection”:

If I had before me a fly and an elephant, having never seen more than one such magnitude of either kind; and if the fly were to endeavor to persuade me that he was larger than the elephant, I might by possibility be placed in a difficulty. The apparently little creature might use such arguments about the effect of distance, and might appeal to such laws of sight and hearing as I, if unlearned in those things, might be unable wholly to reject. But if there were a thousand flies, all buzzing, to appearance, about the great creature; and, to a fly, declaring, each one for himself, that he was bigger than the quadruped; and all giving different and frequently contradictory reasons; and each one despising and opposing the reasons of the others—I should feel quite at my ease. I should certainly say, My little friends, the case of each one of you is destroyed by the rest.

(Augustus De Morgan, “A Budget of Paradoxes” 1872)

There is actually an excellent literature stream investigating which of these measures are roughly equivalent (say arbitrary monotone functions of each other) and which are different (leave aside which are even useful).

Two excellent guides to this rat hole include:

  • Ackerman, M., & Ben-David, S. (2008). “Measures of clustering quality: A working set of axioms for clustering.”" Advances in Neural Information Processing Systems: Proceedings of the 2008 Conference.

  • Warrens, M. (2008). “On similarity coefficients for 2× 2 tables and correction for chance.” Psychometrika, 73(3), 487–502.

The point is: you not only can get a publication trying to sort this mess, you can actually do truly interesting work trying to relate these measures.

Further directions

One can take finding relations and invariants much further as in “Lectures on Algebraic Statistics” Mathias Drton, Bernd Sturmfels, Seth Sullivant, 2008.


It is a bit much to hope to only need to know “one best measure” or to claim to be familiar (let alone expert) in all plausible measures. Instead, find a few common evaluation measures that work well and stick with them.

Categories: Mathematics

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

2 replies

    1. Thanks for the question Georg,

      What you are asking is really cutting to the central point. My response is long because it is an important point.

      Cohen’s kappa (the measure I assume you are referring to) is actually one of the many “inter-rater agreement” measures in the references I included (their is also a multi-rater version called Fliess’ kappa).

      These measures were largely designed for measuring how well two raters agree with each other assuming neither is the ground truth. I have seen them mostly in situations where we assign something like categorization of products many times to many external scorers (like mechanical Turk) and we are trying to determine if two or more scorer’s are behaving in a consistent manner. Cohen’s kappa is essentially something as simple as correlation or accuracy adjusted for the rates the scorers are marking things positive. Changing the details of this changes the name of the measure (for instance the Wikipedia tells us Scott’s pi differs on from Cohen’s kappa on how the chance rate is estimated). My point is these variations change the name of the score much faster than they change the actual utility of the score.

      What Cohen’s kappa does for an unbalanced class (predicting a rare event) is: estimate something like overall correlation or accuracy (which is easy to get a high score on for such a class- just say the event never happens!) and then tries to adjust the score for the fact that high accuracies are easy to achieve when we have unbalanced classes. This is likely most useful when scoring the same tagger across multiple data sets.

      In my opinion it is a fallacy to insist general classifier utility evaluations can be as simple as a single number or a total order prior to introducing additional problem domain details. How rare the target class is doesn’t actually tell you the relative cost of false positives and false negatives (which is necessary additional domain knowledge). So adjustments based only on sample size or outcome distribution can never be sufficient. To be sure my intuition that these is no total order is coming from the more detailed world of scoring classifiers or probability classifiers which induces a much more detailed ROC plot than simple hard or decision based classifiers. But most current machine learning implementations tend to return such scores (neural nets, logistic regression, decision trees, random forests, gradient boosting, and even support vector machines).

      I would suggest at least reporting both of precision and recall or both of sensitivity and specificity. When I ran a research / data-science group at (now an eBay company) we usually reported precision and recall for each and every category in our catalog (around 200 primary categories). So we definitely felt different errors should have very different prices.

      Or (as a variation of what you mentioned): with your business partner put a price on counts on each cell of the confusion matrix and then pick the classifier that induces a confusion matrix maximizing price. This will be an interactive process as once the customer sees the consequences they will have feedback that may help them revise their price estimates. A similar thing can be done tracing along the ROC curve where (after putting population statistics back in) you can say for each point on the curve what confusion matrix would be derived and exactly what your inspection outcome would look like at each point.

      So yes, I am in favor of cost sensitive classification- I just tend to make the cost adjustments after building the classifier by working along the ROC curve. Methods such as re-balancing classes or stratified sampling can be critical to make standard methods to run quickly or converge on very rare classes, but they are not always completely capable of completely re-shaping classifier performance to match business goals ( some notes here ).