Many data scientists (and even statisticians) often suffer under one of the following misapprehensions:
- They believe a technique doesn’t work in their current situation (when in fact it does), leading to useless precautions and missed opportunities.
- They believe a technique does work in their current situation (when in fact it does not), leading to failed experiments or incorrect results.
I feel this happens less often if you are working with observable and composable tools of the proper scale. Somewhere between monolithic all in one systems, and ad-hoc one-off coding is a cognitive sweet spot where great work can be done.
For a whole system to be correct our modeling problem and data must match all the assumptions of all of the pieces simultaneously. In a monolithic system if one component can’t handle co-linear variables or imbalanced classes, then the whole system cannot work in such cases. In a modular system one can change the objective function (say introduce regularization) and the optimizer independent of the model specification to fix such issues.
We also do not want the opposite situation of an explosion of unmaintained hand-rolled ad-hoc or “from scratch” systems. In this case projects keep repeating (like the movie Groundhog Day) and you are continually using re-invented, un-documented, and untested wheels.
We should prefer re-usable components of appropriate scope and scale that we can inspect, substitute, and re-compose as we like. Some examples:
- Model optimality conditions should be separated from the model optimizer/solver/fitter: so we that we can substitute in high power optimizers. Many models (such as logistic regression) have beautiful mathematical structure, but often come locked in with their own optimizer (such as Newton’s method or iteratively re-weighted least squares in the case of logistic regression). This prevents us from substituting out other optimizers and extending such models to allow regularization or coefficient non-negativity constraints. It also prevents easily changing to optimizers that may be appropriate for large data (such as stochastic gradient methods).
- Model evaluation and fit stability/significance evaluation should not be locked into the solver. In R the
glmfitters return detailed summaries including fit quality, (sometimes) model significance, and variable significance. It would actually be much better if we used a standard method to evaluate all of these so we would have the same set of tests as we switched modeling methods, or could switch to tests that are more appropriate to our problem domain. This is why in our book (Practical Data Science with R) we spend significant time showing how to re-compute these summaries from the predictions alone, after modeling. We have productionized and packaged many of the methods in the R package sigr. Systems like LIME (also available in R) represent attempts to generate explanations and variable importances independent of the model.
- Variable preperation and treatment should be packaged (as we document and promote in the R vtreat package). Automating the important mechanical steps (dealing with missing values and high cardinality categorical variables) leaves more time for important domain specific data inspection and preparation steps. Also using well documented tools for these steps can greatly shorten your own paper’s methods section.
- Visualization should be packaged in re-usable teachable graphical schemes (as we promote in the R WVPlots package). All visualizations have to be taught before they are understood, so it makes sense to start re-using them.
Basically we want to be playing with a moderate number of powerful blocks.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.