There’s a common, yet easy to fix, mistake that I often see in machine learning and data science projects and teaching: using classification rules for classification problems.
This statement is a bit of word-play which I will need to unroll a bit. However, the concrete advice is that you often get better results using models that return a continuous score for classification problems. You should make that numeric score available to downstream business logic instead of making a class choice at model prediction time. Informally the word “classifier” to informally mean “scoring procedure for classes” is not that harmful. Losing a numeric score is harmful.
A classification problem is a situation where for each example or instance one has observable features (explanatory variables) and wants to predict for each instance membership in a given class from a finite set of classes.
The entry for discriminant analysis in Everitt’s The Cambridge Dictionary of Statistics 2nd Edition defines “classification rule” in terms of patients (examples) and groups (classes):
This [the discriminant analysis procedure] results in a classification rule (often also known as an allocation rule) that may be used to assign a new patient to one of the two groups.
“Classifier” doesn’t have a direct entry in The Cambridge Dictionary of Statistics, but is most commonly taken to be a synonym for “classification rule.”
Many classification rules are realized by checking if a numeric scoring procedure meets a given threshold. And there is a bit of a cottage industry in shaming those who use the term “classifier” to describe such a numeric scoring procedure prior to committing to a fixed decision threshold. To avoid this one can either:
- Use a classification rule when modeling predicting classes. This has the advantage if one happens to use the term “classifier” you are technically correct. This has the disadvantage you lose a lot of the utility of your model.
- Use a numeric scoring procedure when modeling predicting classes. This has the advantage that you get to pick your decision thresholds or policies later, and can even change or adjust them. This has the disadvantage that if you say you are using a “classifier” you are wrong in the sense that you used your dinner fork on your salad course.
Many data scientists go for the first option. This can cause enormous problems, especially with the type of unbalanced-prevalence tasks we commonly attack: predicting clicks, predicting illness, predicting purchases, and predicting account cancellation. In all these cases the prevalence of the class we are trying to predict is usually not nearly half of the training examples, and the default thresholds in classification rules tend to ill-serve.
The distinction between classifications and scores is easy to describe in terms of
model.predict(X) is often implemented as
model.predict_proba(X)[:, 1] >= threshold, with
threshold defaulting to
.predict_proba() has the advantage of giving you good control of the threshold, allowing model metrics such as AUC and deviance, and more.
Or, to abuse Clemenceau:
Thresholds are too important be left to the defaults.
One may object: “if this is so important why don’t more people talk about this problem?”
People do in fact talk about the problem of classification rules in the presence of unbalanced data, they just talk about it in the context of the commonly taught mitigations: up-sampling and down-sampling. Up-sampling is the process of duplicating data from the rare class so that it appears to be more common in the training set. This can solve some problems, but introduces many others (lack of independence between the training data rows, and an inability to safely test/train split data or use cross-validation techniques without working around the up-sampling). Down-sampling is the process of removing examples from the common class so the rare class again appears to be more prevalent in the training set. This is statistically inefficient.
My advice is:
- Prefer models that return probabilities. That is call
.predict_proba(), and not
- Do not use up-sampling or down-sampling unless you are sure you need to. Often moving the decision threshold is easier and will work just as well if not better than up-sampling or down-sampling.
- Do not use accuracy as your development performance measure, instead use AUC (for non-probability scores) or deviance/cross-entropy (for probabilities).
- For business policies (or classification procedures) work with your business partners to find a threshold that represents a good precision/recall or sensitivity/specificity trade-off. What is a good trade-off is application specific. Often one wants to later change a threshold in production.
I hope to follow up on this note with a technical analysis the effect of prevalence (distribution of classes) on my favorite method to apply classification problems: logistic regression. I’ll probably even call logistic regression “a classifier.”
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.