I am working on a promising new series of notes: common data science fallacies and pitfalls. (Probably still looking for a good name for the series!)
I thought I would share a few thoughts on it, and hopefully not jinx it too badly.
Data science, is for better or worse, an empirical field. Please consider the following partial definition of science from the Oxford English dictionary:
A branch of knowledge conducted on objective principles involving the systematized observation of and experimentation with phenomena …
My point being: data science is a science. It is the science of studying the empirical phenomena of applying modeling methodologies. Data science isn’t the study of the areas it is applied to, or even of the development of the tools it uses.
Methodologies that empirically work are further used and studied. This means they tend to evolve or accrete into complex, but not fully correct systems.
I think a lot of statisticians’ issue with data science is the following. Data science, while a producer and consumer of mathematical content, isn’t a mathematical field. Formal correctness isn’t a central tenet of data science. Empirical effectiveness is the central tenet.
Methods that tend to work are used, and reproduce through sharing.
I too find this disappointing. My background is mathematical. I still feel up-front time establishing theoretical correctness pays off in the long term.
So I am working on a small series of practices that are not quite correct in data science. Where I can, I will identify correct alternatives. What I am trying to avoid is the accumulation of not quite correct patches and affordances that introduce more problems are then further patched and so on. This cycle leads to some of the complexity of current machine learning infrastructure.
This, unfortunately, will not often lead to huge improvements. Any examined practice that was very wrong would, in an empirical field, get patched or removed.
This is like the old finance joke:
Two traders see a $100 bill on the street. One starts to bend to pick it up. The other stops him by saying: “Wait, if it were a real $100 somebody would have already picked it up by now.”
So we will go after smaller denomination bills.
We will show small improvements that can stack (many small improvements can be a large improvement), and cut down on complexity (using a correct method lessens the need for patches and the problems they then introduce).
Our first examples in this series are already out:
Some of this work is in collaboration with Dr. Nina Zumel and Professor Norm Matloff.
Many of these issues are “small nicks.” But one can die by a thousand cuts. So I want to help get rid of some of these issues, get rid of the work-arounds or patches used to deal with the problems, and get rid of the problems the patches introduce. Break the autocatalytic chain of faulty patches patching faulty patches.
For a lot of these issues I expect the following stage reactions (even all from some of the same respondents):
- Nobody does that.
- You have no right to say that.
- We have always done that, as it isn’t a problem.
- There is no way to fix that.
- We already work around the problem in the following way.
The series will take work-arounds as evidence of a problem, but not as solutions. I want to emphasize methods that identify and avoid the core problems.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.