I want to talk about a misconception on the difference between inference and prediction. For a well run analytically oriented business, there may not be as many reasons to prefer inference over prediction one may have heard.
A common refrain is: data scientists are in error in centering so much on prediction, a mistake no true
Scotsman statistician would make.
I’ve actually come to question this and more and more. Mere differences in practice between two fields doesn’t immediately imply either field is inferior or in error. Differences can be evidence we are looking at two different fields, and not two names for the same field.
In this note we will explore the possibility that businesses may not always extract much extra value in insisting on inference. This is to say: industrial data scientists may not be wrong in concentrating on prediction, especially if the track results or better yet perform controlled experiments (both of which are in fact common business analytics practices).
This inference/prediction distinction is state somewhat differently in different venues. For example:
- Inference: Use the model to learn about the data generation process.
- Prediction: Use the model to predict the outcomes for new data points.
Much of the discussion probably links back to Breiman, Leo. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statist. Sci. 16 (2001), no. 3, 199–231. doi:10.1214/ss/1009213726. https://projecteuclid.org/euclid.ss/1009213726. And the topics is written on often. For example:
Predictions: The outputs emitted by a model of a data generating process in response to a specific configuration of inputs.
Inferences: The information learned about the data generating process through the systematic comparison of predictions from the model to observed data from the data generating process.
Roughly “the data generating process” is code for assuming the proposed model, often a model with coefficients, is structurally correct. Under this rubric, correct estimation of the coefficients of the unobserved ideal population from the sample is what is often meant by inference. This is compatible with the usual definition of statistical inference:
Inference: The process of drawing conclusions about a population on the basis of measurements or observations made on a sample of individuals from the population.
B. S. Everitt; The Cambridge Dictionary of Statistics, 2nd Edition, Cambridge University Press, 2005.
The distinction often gets coarsened down in conversation to “data scientists care only about prediction, and statisticians care only about the coefficients.” This, in turn, is often used to criticize “black box models” such as ensembles of decision trees, deep learning, support vector machines, and so on as being models without small sets of parameters that can be considered to be interpretable coefficients.
Now let’s look and modeling from a pure utilitarian or a pure business perspective. In my opinion, this changes things quite a bit. We can use statistics to criticize the business practices of data science or business practices of data science to criticize statistics. Let’s set up one example and work the criticism forward and back.
Our in industry data scientist builds a predictive model that predicts the probability of a web visitor making a purchase as a function of some explanatory variables. They then take this model and apply it to several different web traffic sources they are willing to buy web traffic from (essentially places they are willing to place advertisements). They use these aggregated predictions to recommend an increase purchase from some of the sources, or perhaps even purchases from new sources.
- A possible statistical critique. The model was used without first studying the model’s behavior with respect both to the explanatory variables and possible omitted variables (including possibly the traffic source ids). Because you are using the model to propose a change in purchase as a function of the traffic source IDs, you have essentially exposed the implied coefficients of a model using only the traffic source IDs. This is without ever performing any investigation of, or diagnostics on, this implied model.
- A business critique. The business question of “does it make sense to increase or decrease the purchases from a given source” is a fundamentally different question than asking if the inference of the relation between traffic source and expected customer value is correct.
One reason for this is: the purchase can fail even under correct inference or a perfect model. For example: it is common, when increasing volume from a traffic source, to receive lower quality traffic with a different distribution of values of the explanatory variables. The effect is common, as the traffic sources often model the purchase intent (using information such as ad copy or search terms) and send traffic in a highest quality first order. Larger purchases run further down the order, to lower quality matches.
It is common practice to follow up a proposal or change with controlled experiment of buying more traffic from apparently more valuable sources, and checking if the distribution of explanatory variables remains favorable and if the predicted higher value also remains. If not a full controlled experiment, the much maligned dashboards are used to track drift and decay.
I’ll continue to take the data science side of the argument (or drop the fiction that I didn’t just write both ends of the dialogue/argument) and end with the following.
Obviously finding mistakes earlier and cheaper is a benefit. Data scientists do look for diagnostics that give them early warnings.
I think criticizing the quick use of predictions may be based on a mis-conception that “predict and use” is the entire business experimental process. Prediction isn’t the entire experiment, controlled testing and tracking are routine follow-ups and part of the larger business analytics feedback cycle.
I am claiming a good reactive business-loop works with the following.
The information learned about the environment through the systematic comparison of predictions from the model to observed data from controlled experiments.
To my mind this is related to my earlier note claiming data science is an empirical study of methodologies. Things that work in practice are used more often. Things that are truly disastrous are eventually selected out.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.