“Essentially, all models are wrong, but some are useful.”
Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.
We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?
Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.
In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).
To organize the ideas into digestible chunks, we are presenting this article as a four part series (to finished in the next 3 Tuesdays). This part (part 1) sets up the specific problem.
Our example problem
Let’s use a single example to make things concrete. We have used the 2009 KDD Cup dataset to demonstrate estimating variable significance, so we will use it again here to demonstrate model evaluation. The contest task was supervised machine learning. The goal was to build scores that predict things like churn (account cancellation) from a data set consisting of about 50,000 rows (representing credit card accounts) and 234 variables (both numeric and categorical facts about the accounts). An IBM group won the contest with an AUC (“area under the curve”) of 0.76 in predicting churn on held-out data. Using R we can get an AUC of 0.71 on our own hold-out set (meaning we used less data for training) using automated variable preparation, standard gradient boosting, and essentially no parameter tuning (which itself can be automated as it is in packages such as caret).
Obviously a 0.71 AUC would not win the contest. But remember the difference between 0.76 and 0.71 may or may not be statistically significant (something we will touch on in this article) and may or may not make a business difference. Typically a business combines a score with a single threshold to convert it into an operating classifier or decision procedure. The threshold is chosen as a business driven compromise between domain driven precision and recall (or sensitivity and specificity) goals. Businesses do not directly experience AUC which summarizes facts about the classifiers the score would induce at many different threshold levels (including ones that are irrelevant to the business). A scoring system whose ROC curve contains another scoring system’s ROC curve is definitely the better classifier, but small increases in AUC don’t always ensure such containment. AUC is an acceptable proxy score when choosing among classifiers (however, it does not have a non-strained reasonable probabilistic interpretation, despite such claims), and it should not be your final business metric.
For this article, however- we will stick with the score evaluation measures: deviance and AUC. But keep in mind that in an actual data science project you are much more likely to quickly get a reliable 0.05 increase in AUC by working with your business partners to transform, clean, or find more variables- than by tuning your post-data- collection machine learning procedure. So we feel score tuning is already over-emphasized and don’t want to dwell too much more on it here.
Choice of utility metric
One way a data science project differs from a machine learning contest is that the choice of score or utility metric is an important choice made by the data scientist, and not a choice supplied by a competition framework. The metric or score must map to utility for the business client. The business goal in supervised machine learning project is usually either classification (picking a group of accounts at higher risk of churn) or sorting (ordering accounts by predicted risk).
Choice of experimental design, data preparation, and choice of metric can be a big driver of project success or failure. For example in hazard models (such as predicting churn) the items that are easiest to score are items that have essentially already happened. You may have call-center code that encodes “called to cancel” as one of your predictive signals. Technically it is a great signal, the person certainly hasn’t cancelled prior to the end of the call. But it is useless to the business. The data-scientist has to help re-design the problem definition and data curation to focus in on customers that are going to cancel soon, but to indicate some reasonable time before they cancel (see here for more on the issue). The business goal is to change the problem to a more useful business problem that may induce a harder machine learning problem. The business goal is not to do as well as possible on a single unchanging machine learning problem.
If the business needs a decision procedure: then part of the project is picking a threshold that converts the scoring system into a classifier. To do this you need some sort of business sensitive pricing of true-positives, false-positives, true-negatives, and false-negatives or working out appropriate trade-offs between precision and recall. While tuning scoring procedures we suggest using one of deviance or AUC as a proxy measure until you are ready to try converting your score into a classifier. Deviance has the advantage that it has nice interpretations in terms of log-likelihood and entropy, and AUC has the advantage that is invariant under any one-to-one monotone transformation of your score.
A classifier is best evaluated with precision and recall or sensitivity and specificity. Order evaluation is best done with an AUC-like score such as the Gini coefficient or even a gain curve.
A note on accuracy
In most applications the cost of false-positives (accounts the classifier thinks will churn, but do not) is usually very different than the cost of false-negatives (accounts the classifier things will not churn, but do). This means a measure that prices these two errors identically is almost never the right final utility score. Accuracy is exactly one such measure. You must understand most business partners ask for “accurate” classifiers only because it may be the only term they are familiar with. Take the time to discuss appropriate utility measures with your business partners.
Here is an example to really drive the point home. The KDD2009 data set had a churn rate of around 7%. Consider the following two classifiers. Classifier A that predicts “churn” on 21% of the data but captures all of the churners in its positive predictions. Classifier B that predicts “no churn” on all data. Classifier A is wrong 14% of the time and thus has an accuracy of 86%. Classifier B is wrong 7% of the time and thus has an accuracy of 93% and is the more accurate classifier. Classifier A is a “home run” in a business sense (it has recall 1.0 and precision 33%!), Classifier B is absolutely useless. See here, for more discussion on this issue.
In all cases we are going to pick a utility score or statistic. We want to estimate the utility of our model on future data (as our model will hopefully be used on new data in the future). The performance of our model in the future is usually an unknowable quantity. However, we can try to estimate this unknowable quantity by an appeal to the idea of exchangeability. If we had a set of test data that was exchangeable with the unknown future data, then an estimate of our utility on this test set should be a good estimate of future behavior. A similar way to get at this is if future data were independent and identically distributed with the test data then we again could expect to make an estimate.
The issues we run into in designing an estimate of model utility include at least the following:
- Are we attempting to evaluate an actual score or the procedure for building scores? These are two related, but different questions.
- Are we deriving a single point estimate or a distribution of estimates? Are we estimating sizes of effects, significances, or both?
- Are we using data that was involved in the training procedure (which breaks exchangeability!) or fresh data?
Your answers to these questions determine what procedures you should try.
We are going to work through a good number of the available testing and validation procedures. There is no “one true” procedure, so you need to get used to having more than one method to choose from. We suggest you go over each of the upcoming graphs with a ruler and see what conclusions you can draw about the relative utility of each of the models we are demonstrating.
The no-measure procedure is the following: pick a good machine learning procedure, use it to fit the data, and turn that in as your solution. In principle nobody is ever so ill mannered to do this.
However, if you only try one modeling technique and don’t base any decision on your measure or score- how does that differ from having made no measurement? Suppose we (as in this R example) only made one try of Random Forest on the KDD2009 problem? We could present our boss with a ROC graph like the following:
Because we only tried one model the only thing our boss can look for is the AUC above 0.5 (uselessness) or not. They have no idea if 0.67 is large or small. Since or AUC measure drove no decision, it essentially was no measurement.
So at the very least we need to set a sense of scale. We should at least try more than one model.
Model supplied diagnostics
If we are going to try more than one model, we run into the problem that each model reports different diagnostics. Random forest tends to report error rates, logistic regression reports deviance, GBM reports variable importance. At this point you find you need to standardize on your own quality of score measure and run your own (or library code) on all models.
Now that we have framed the problem, we will continue this series with:
- Part 2: In-training set measures
- Part 3: Out of sample procedures
- Part 4: Cross-validation techniques
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.