Menu Home

On Being a Data Scientist

When people ask me what it means to be a data scientist, I used to answer, “it means you don’t have to hold my hand.” By which I meant that as a data scientist (a consulting data scientist), I can handle the data collection, the data cleaning and wrangling, the analysis, and the final presentation of results (both technical and for the business audience) with a minimal amount of assistance from my clients or their people. Not no assistance, of course, but little enough that I’m not interfering too much with their day-to-day job.

This used to be a key selling point, because people with all the necessary skills used to be relatively rare. This is less true now; data science is a hot new career track. Training courses and academic tracks are popping up all over the place. So there is the question: what should such courses teach? Or more to the heart of the question — what does a data scientist do, and what do they need to know?

Hilary Mason and Chris Wiggins took a crack at answering that a couple of years ago. They break down data science, the process, into 5 steps:

  • Obtain the data: in their case from Web APIs.
  • Scrub the data: Look for missing data, bad data, outlier. Regularize text data (for instance locations: is “CA” California, or Canada? What about “Cal.”, “Ca”, “California”, “San Francisco”, etc..).
  • Explore. And visualize. Here and during the scrub step is where I might start thinking about the best representations of the data, for modeling. Here is where I begin variable selection.
  • Model. And evaluate. This is where the statistics and machine learning knowledge comes in.
  • Interpret. And disseminate.

That’s basically the breakdown most of us would give. Their focus is on web analysis and on data collection over the web; generally my focus has been on clients who have the data (albeit in some completely cryptic form); it’s our job to wring some insight from it — somehow. So in addition to the emphasis that Mason and Wiggins place on scripting languages and unix tools, I would also add knowledge of SQL, and a tool like R that can access data directly from the database for analysis. My colleague John Mount would also add that version-control is a must. As with software engineering, data science is a process where “that one last tweak” to the model or to the data handling can turn out to be a tweak too many…

Beyond the tools, and the technical details, though — what would I add?

I would add that the process is a loop; more than that, it’s loops within loops. Obtain-Scrub-Explore is often one loop. Scrub (Represent)-Explore-Model can be another loop. It always depends.

I would add that a healthy understanding of the business processes that generate the data is essential — otherwise you are apt to “discover” things in the data that everyone (that is, your client) already knows, because they are known artifacts of the business process. The insights in data are like degrees of freedom. Don’t eat up your degrees of freedom on known phenomena. If you don’t have that domain knowledge yourself, make sure your client partners you with a contact who does.

I would add that a solid understanding of statistics fundamentals is essential (and the whole Win-Vector blog attests to how much time we spend thinking about fundamentals), but stat and machine learning are not the core of the job. The real science, in my opinion — the part where you form hypotheses, test them, revise them — comes less in the modeling and more in the scrub and explore steps. Why does this branch of the bank report recoveries where they never reported losses? What is that “profit” column reporting, really? Does gross national product really predict mortgage defaults, or is it just a proxy variable for time (and in the recent economy, time predicts mortgage default rate pretty well)?

And there is more science after the modeling, during the evaluation phase. Or more prosaically: the debug phase. Why does the model report absolute nonsense on this one subset of the data? Is the error in the modeling? The data handling? The programming? The “modeling” step itself is actually a very small, and relatively straightforward, part of the overall process.

No one ever wants to hear this. We all come into the job hoping to wield support vector machines or neural nets like Wonder Woman wields her magic lasso: we capture the data, and then wrest the truth out of it, willy-nilly. I wish. I’d love to wear those cool bullet-deflecting bracelets, too.

NewImageArt: Alex Ross

My point here is that answering the original question is more a discussion of process than a checklist of skills to have and technologies to be familiar with.

What would you add? What do you think a data scientist needs to know?

This was originally posted at; reblogged here.

Categories: Opinion

Tagged as:

Nina Zumel

Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.

4 replies

  1. I generally end up feeling like I’m a data getter and cleaner about half the time. I’m hoping to discover new ways to shorten this cycle as it is the least fun thing to do. Good software practices and some kind of “agile” development loop are generally important. Nobody teaches that in school. Managing expectations and communicating limitations is huge; fortunately, I have a chaotic personal life, which helps.
    SVM/neurons/KNN/Forests or whatever are generally interchangeable, and not as useful as linear models, filters or rigorous hypothesis testing/detection theory/whatever you call it. Oh yeah, variable selection techniques are hugely important, IMO. Some ML techniques can deal with extraneous variables, but you’re much better off without them.

    I must say, I thought this was a John Mount post at first, and wondered if he was “going through some changes” wrt the photo. ;-)

  2. @Justin R
    The Spiral Model is a good metaphor for what I’m talking about, I think. The primary point is that the process is not a straightforward waterfall; there is a lot of looping back to previous stages to, e.g. get better data because the data you have is inadequate or too dirty, or whatever. And often one discovers that they haven’t been asking the right question in the first place, so back you go…

    This is common sense, perhaps, but it really isn’t how most people present the data science process, and I think it’s worth calling out.

  3. @Scott Locklin
    Yeah, I’m not crazy about the data getting/cleaning step either, though I like the visualization and representation phases might be my favorite part. Unfortunately, scrape and scrub are part of the job, and if they weren’t, we might not have a job….

    Agreed on the agile practices and good software engineering style hygiene. Managing expectations is definitely a big part of the job — thanks for pointing that out. Communication with the client, in general, as you say. Agreed that the “basic” techniques, like regression, are often the most suitable.

    And if I can channel some of that John Mount crankiness, that’s probably a good thing :)
    I probably look better than he does in cuff bracelets and high-heeled boots, anyway…

%d bloggers like this: