Menu Home

Can a classifier that never says “yes” be useful?

Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so your partners know what you are actually working towards and do not consider late choices of criteria disappointments or “venue shopping.”

Let’s work with a simple but very common example. You are asked to build a classification engine for a rare event: say default in credit card accounts. In good times for well managed accounts it is easy to imagine the default rate per year could be well under 1%. In this situation you do not want to propose predicting which accounts will actually default in a given year. This may be what the client asks for, but it isn’t reasonable to presume this is always achievable. You need to talk the client out of a business process that requires perfect prediction and work with them to design a business process that works well with reasonable forecasting.

Why is such prediction hard? Usually prediction in these situations is hard because while you usually have access to a lot broad summary data for each account (net-worth, age, family size, number of years account has been active, patterns of borrowing, amount of health insurance, amount of life insurance, patterns of re-payment and so on) you usually do not have access to many of the factors that trigger the default or even when you do such variables are not available very long before the event to be predicted. Trigger events for default can include sudden illness, physical accident, falling victim to a crime and other acute set-backs. The point is: two families without health insurance may have an equally elevated probability of credit default, but until you know which family gets sick you don’t know which one is much more likely to default.

Why does everybody ask for prediction? First: good prediction would fantastic, if they could get it. Second: most layman have no familiar notion of classifier quality other than accuracy (and measures similar to accuracy). And if all you know is accuracy then all you are prepared to discuss is prediction. So the client is really unlikely to ask to optimize a metric they are unfamiliar with. The measures that help get you out of this rut are statistical deviance and information theoretic entropy; so you will want to start hinting at these measures early on.

How do we show the value of achievable forecasting? For this discussion we define forecasting credit default as the calculation of good conditional probability estimate of credit default. To evaluate forecasts we need measures beyond accuracy and measurers that directly evaluate scores (without having to set a threshold to convert scores into predictions).

Back to our example. Suppose that in our population we expect 1% of the accounts to default. And we build a good forecast or scoring procedure that for 2% of the population returns a score of 0.3 and for the remaining 98% of the population returns a score near 0.01. Further suppose our scoring algorithm is well calibrated and excellent: the 2% of the population that it returns a score of 0.3 and above on actually tends to default at a rate of 30%.

Such a forecast identifies a 2% subset of the population that has a 30% chance of defaulting. Treated as a classifier it never says “yes” because it has not identified any examples that are estimated to have at least a 50% chance of defaulting (obviously we can force it to say “yes” by monkeying with scoring thresholds). So the classifier is not a silver bullet predictor. But it may (when backed with the right business process) be a fantastic forecaster: the subset it identifies is only 2% of the overall population yet has 60% of the expected defaults! Designing procedures to protect the lender from these accounts (insurance, cancellation, intervention, tighter limits, tighter payment schedule or even assistance) represents a potential opportunity to manage half of the lender’s losses at minimal cost. To benefit the client must both be able to sort or score accounts and have a business process that is not forced to treat all accounts as identical.

As we have said: laymen tend to only be familiar with accuracy. And accuracy is not a good measure of forecasts (see: “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures). What you need to do is shop through metrics before starting your project and find one that is good for your client. Finding a metric that is good for your client involves helping them specify how classifier information will be used (i.e. you have to help them design a business process). Some types of scores to try with your client include: lift, precision/recall, sensitivity/specificity, AUC, deviance, KL-divergence and log-likelihood.

Time spent researching and discussing these metrics with your client is more valuable to the client than endless tweaking and tuning of a machine learning algorithm.

For a more on designing projects around good data science project metrics please see Zumel, Mount, “Practical Data Science with R” Chapter 5 Choosing and Evaluating Models which discusses many of the above metrics.

Categories: Opinion Pragmatic Data Science Tutorials

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

2 replies

  1. Wow. Another great post! A seductive aspect of these projects is that the analyst can always tweak a parameter to get different results. The point is, there is no statement of significance such as with classical hypothesis testing. In classical hypothesis testing, the researcher might pick a significance level of 0.05. If a result comes back at p-value of 0.06, they’re not supposed to say “That’s really close to being significant.” The result either is or isn’t significant. They picked the 0.05 level before starting and are bound by the rigour of good research practice to not change it. The analog being do not adjust parameters for a model until you get the confusion matrix you want.

    Question: For projects such as the one you describe, do you think we’ll get to a point where there is a rigour? For example, the analyst and business agree up front on a threshold for a test, stick to it, and if the tests don’t confirm a useful model, they stop and move on? “Move one” means working on business processes.

    I’d hate to brakes on discovery. Another seductive aspect of these projects is that the next mode or way of looking at the problem or data could be the breakthrough that truly adds business value.

  2. @Phillip Burger
    Thanks Phillip. All of this is probably more relevant to the “Drowning in Insignificance” article. But there are ideas to improve procedures (one of the other commenters gave a great reference). One thing you can do is insist on a clean run on new data after all the tweaks are done and see if things look like a good result or you are just seeing a reversion to mediocrity.