# The Science of Data Analysis

I am re-reading from the great statistician John W. Tukey’s paper:

Tukey, John W. “The Future of Data Analysis.” Ann. Math. Statist. 33 (1962), no. 1, pp. 1–67. `doi:10.1214/aoms/1177704711`. https://projecteuclid.org/euclid.aoms/1177704711

I’ve taken the liberty of pulling out some quotes that are very relevant to the usual “data science is not a science” snobbery. I find the “data scientists are carpetbaggers taking food out of the mouths of statisticians” or “data science is just statsistics done wrong” attitudes to be off. The tasks data scientists attempt are much closer to the type of consulting that is most successfully taught and done as operations research (some of my notes on this can be found here), not the content of formal statistics.

Let’s cherry-pick a few quotes from the paper.

Data analysis is a larger and more varied field than inference, …

[op. cit., pg. 2.]

In the above, I would take “inference” to mean the art of statistics.

There are diverse views as to what makes a science, but three constituents will be judged essential by most, viz:

• (a1) intellectual content,
• (a2) organization into an understandable form,
• (a3) reliance upon the test of experience as the ultimate standard of validity.

By these tests, mathematics is not a science, since its ultimate standard of validity is an agreed-upon sort of logical consistency and provability.

As I see it, data analysis passes all three tests, and I would regard it as a science, one defined by a ubiquitous problem rather than by a concrete subject.

[op. cit., PP. 5-6]

So I would say, complaints about the wild empirical nature of data science are likely just general discomfort with empiricism in general. Data science is a science, and science isn’t always as pretty as the idealized court plays of the atelier or academy.

Keep in mind: even poor empiricism is vastly preferable to mind-poisoning arcana (the “alternative”). There are more definitions of science than mere operationalism, or even positivism (Paul Feyerabend’s criticisms in particular). However, the above points are fairly relevant to what science aspires to be.

And we now get to the core of the argument.

Finally, we need to give up the vain hope that data analysis can be founded upon a logico-deductive system like Euclidean plane geometry (or some form of the propositional calculus) and face up to the fact that data analysis is intrinsically an empirical science.

[op. cit., pg. 63.]

And Tukey even comments on the seeming lack of coherence in the science of data analysis.

The future of data analysis can involve great progress, the overcoming of real difficulties, and the provision of a great service to all fields of science and technology.

[op. cit., pg. 64.]

Roughly we see the data analysis has diverse practices precisely because it is applied in diverse situations.

Probably I’ll start tagging some of our notes with “The Science of Data Analysis.”

Tagged as:

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.