I am working on a promising new series of notes: common data science fallacies and pitfalls. (Probably still looking for a good name for the series!)
I thought I would share a few thoughts on it, and hopefully not jinx it too badly.
Data science, is for better or worse, an empirical field. Please consider the following partial definition of science from the Oxford English dictionary:
A branch of knowledge conducted on objective principles involving the systematized observation of and experimentation with phenomena …
My point being: data science is a science. It is the science of studying the empirical phenomena of applying modeling methodologies. Data science isn’t the study of the areas it is applied to, or even of the development of the tools it uses.
Methodologies that empirically work are further used and studied. This means they tend to evolve or accrete into complex, but not fully correct systems.
I think a lot of statisticians’ issue with data science is the following. Data science, while a producer and consumer of mathematical content, isn’t a mathematical field. Formal correctness isn’t a central tenet of data science. Empirical effectiveness is the central tenet.
Methods that tend to work are used, and reproduce through sharing.
I too find this disappointing. My background is mathematical. I still feel up-front time establishing theoretical correctness pays off in the long term.
So I am working on a small series of practices that are not quite correct in data science. Where I can, I will identify correct alternatives. What I am trying to avoid is the accumulation of not quite correct patches and affordances that introduce more problems are then further patched and so on. This cycle leads to some of the complexity of current machine learning infrastructure.
This, unfortunately, will not often lead to huge improvements. Any examined practice that was very wrong would, in an empirical field, get patched or removed.
This is like the old finance joke:
Two traders see a $100 bill on the street. One starts to bend to pick it up. The other stops him by saying: “Wait, if it were a real $100 somebody would have already picked it up by now.”
So we will go after smaller denomination bills.
We will show small improvements that can stack (many small improvements can be a large improvement), and cut down on complexity (using a correct method lessens the need for patches and the problems they then introduce).
Our first examples in this series are already out:
Some of this work is in collaboration with Dr. Nina Zumel and Professor Norm Matloff.
Many of these issues are “small nicks.” But one can die by a thousand cuts. So I want to help get rid of some of these issues, get rid of the work-arounds or patches used to deal with the problems, and get rid of the problems the patches introduce. Break the autocatalytic chain of faulty patches patching faulty patches.
For a lot of these issues I expect the following stage reactions (even all from some of the same respondents):
- Nobody does that.
- You have no right to say that.
- We have always done that, as it isn’t a problem.
- There is no way to fix that.
- We already work around the problem in the following way.
The series will take work-arounds as evidence of a problem, but not as solutions. I want to emphasize methods that identify and avoid the core problems.
Categories: Administrativia Expository Writing Mathematics Statistics Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
1.Do you consider Mathematics, Statistics or History to be Sciences? Why/why not?
2.If the object of Data Science is Modelling Methodologies and not Data, why isn’t it called Modelling Methodologies Science?
3.Isn’t the usual product of a Data Scientist’s work the best model for a given data set?
3.1.If so, isn’t it the object of Data Science data (as the name implies) and model comparison its methodology?
3.2.If so, is it Data science a science if its methodology is not the experimental method?
4.If Data Science is about empirically finding out which modelling methodologies are better, shouldn’t the main objective of a Data Scientist’s work be the identification of conditions that affect model performance using the experimental method?
5.Are manuals and scientific papers the real product of a Data Scientist’s work and the finding of the best model for a set of data an applied Data Science job?
See: https://en.wikipedia.org/wiki/Demarcation_problem (it isn’t greatly developed, but a good place to start)
I’ll answer just a few. And just to start, I have studied on the nature of sciences beyond naive empiricism. In fact even took intro philosophy from Paul Feyerabend.
Thanks for the topic. I have been thinking about the question – “Where is the science in data science”. I am particularly interested in the effect of environmental variables on data collected during mechanical experiments – in particular, how the environment effects mechanical clocks. At the levels of interest, temperature, barometric pressure, and humidity have to be either controlled or compensated for. Correlation between those variables is not just statistics, there is a complicated formula that gives the relationship (psychrometric chart). You can find examples on the internet of temperature vs humidity that are just scatter plots of points with a regression line. If you just take the additional step of including the lines between the points, a very different slope is apparent – it depends on how much water is available. Similar problems exist with time delays due to the rate of change of the environmental variables.
So, yes, you can do multiple linear or non-linear regression using statistical methods in an attempt to understand system performance, but is there any way to include known variable sensitivity within a statistical model? I don’t seem to be able to find the issue discussed.
Editor – Horological Science Newsletter.
I always jokingly call the problems of the type you are referring to as “econometrics.” This is because econometricians often face this problem.
Roughly you have a set of formulas derived from theoretical knowledge how a mechanical clock will perform based on humidity, temperature, and other quantities. And these formulas also have some unknown parameters for a given clock. So you might want to fit for those parameters from observational data. You can do this with an optimizer, but it is often hard as the data and parameters don’t relate in the nice ways that a linear regression or logistic regression do. To make matters worse the model is often insensitive to certain parameters- meaning they can’t be fit from data. So to work this you need some soft of optimizer that tries many re-starts and values, such as annealing or a Monte-Carlo Markov chain.
It gets ugly fast. Machine learning and statistics pick functions that are easy to optimize (the exception being deep neural nets).