Menu Home

Who is allowed to call themselves a data scientist?

It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.

Gartner hype cycle (Wikipedia).

Given we agree data science exists, who is allowed to call themselves a data scientist?

There is a school of thought that you can not call yourself a data scientist unless you master all of the following:

  • Statistical learning theory
  • High dimensional geometry
  • Optimization theory
  • Petabyte scale operations
  • Advanced programming
  • Combinatorics and algebra
  • Theoretical computer science
  • Measure theory
  • All of statistics
  • SQL
  • noSQL
  • Distributed System design

Many of these are topics covered in works such as Foundations of Data Science (John Hopcroft, Ravindran Kannan) and Mining of Massive Data Sets (Jure Leskovec , Anand Rajaraman, Jeffrey David Ullman).

These are topics I know, and many of these authors are personal heroes:

  • John Hopcroft: One of the founders of modern design and analysis of algorithms. Coauthor of Introduction to Automata Theory, Languages, and Computation.
  • Ravindran Kannan: My advisor! Definitely brilliant.
  • Anand Rajaraman: CEO I had the honor of working for at, one of the inventors of Mechanical Turk, also brilliant.
  • Jeffrey David Ullman: One of the founders of modern design and analysis of algorithms. Coauthor of Introduction to Automata Theory, Languages, and Computation.

The theory is: only the unicorn who knows all of the above is to be allowed to call themselves a data scientist.

However, when Nina Zumel and I wrote Practical Data Science with R (Manning 2014) we took an opposite approach. We deliberately widened data science to:

a field that uses results from statistics, machine learning, and computer science to create predictive models.
Practical Data Science with R, “about this book”, page xix.

And here is why: outside of academia and some major labs the task of data science is essentially looking at client data and building useful predictive models.

This is good news. Statisticians know that prediction is fundamentally easier than inference (as prediction dodges many issues of causality). And most real world business clients have data at what we call “SQL scale” (fits in a nice database that can quickly run complicated SQL aggregations, not requiring a petabyte infrastructure). Clients tend to need automated decision procedures yielding high ROI (Radio Over the Internet Return On Investment) to free up analysts for new problems.

And that brings to the point of this essay. Because all of the analyst jobs have been re-classified as “data science” jobs we have to allow analysts to call themselves “data scientists”.

Categories: Pragmatic Data Science Quantitative Finance

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

3 replies

  1. Another point is: if we believe one group can claim exclusive priority on “data science” (statisticians slamming the door on computer scientists, or computer scientists slamming the door on statisticians) the group most likely to win is operations research.

    Here is a very clear description of a what we would call a complete data science project in 1949: Magee, John. “Operations Research at Arthur D. Little, Inc.: The Early Years.” Operations Research, 2002. 50 (1), pp. 149-153, . We took care to mention this example in Practical Data Science with R, as it is such a perfect example of the type of consultive answers data science customers want.

    Or it could end up with the actuaries winning, as they already have professional licensing worked out.