For data science projects I recommend using source control or version control, and committing changes at a very fine level of granularity. This means checking in possibly broken code, and the possibly weak commit messages (so when working in a shared project, you may want a private branch or second source control repository).
Please read on for our justification.
The issue we are facing is: Chesterton’s Fence
In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”
How this appears in software or data science projects is often: “harmless cleanup” steps break your project, and you don’t detect this until much later.
The Chesterton’s Fence parable always amused me, as it doesn’t have an actual example of adverse consequences (though I always mis-remember it as having an example). Nobody who does actual work is in fact careful enough or knowledgable enough to always avoid removing Chesterton’s fence as a matter of foresight. However, in hindsight you often can see the problem. Luckily: version control is a time machine that translates common hindsight into more valuable foresight. You can travel to before a mistake, with knowledge of the consequences of making such a mistake.
So, let’s add a minor data science example.
I’ve recently been playing around with a Keras/Tensorflow project, which I will probably write-up later. At some point I “cleaned up” the code by replacing a unsightly tensor slice of the form
x[:, (j-1):j] with a more natural looking indexing
x[:, j-1]. What I neglected is, Tensorflow uses the tensor rank/shape details to record the difference between a single data-column and a data-frame containing a single data-column (a small distinction that is very important to maintain in data science projects). This “cleanup” broke the code in a non-signaling way as additional Tensorflow re-shaping rules allowed the calculation to move forward with incorrect values. A few changes later I re-ran the project evaluation, and the model performance fell precipitously. I had no idea why a model that recently performed well now didn’t work.
The saving grace was: I had committed at very fine granularity even during the “harmless code clean-up” using git version control. Exactly the set of commits you would be embarrassed to share. These “useless” commits saved me. I could quickly bisection search for the poison commit. The concept is illustrated in chapter 11 of Practical Data Science with R (please check it out!) as follows:
Now git is a bit of “when you walk with it you need fear no other” protector. In the process of finding the breaking change I accidentally checked out the repository to a given version (instead of a specific file), causing the dreaded “git Detached HEAD” issue in my source control repository. But the win was: that was a common researchable problem with known fixes. I was happy to trade my “why did this stop working for no reason” mystery for the routine maintenance task of fixing the repository after finding the root cause.
And that is the nature of source control or version control: it is a bunch of technical considerations that end-up being a net positive as they can save you from worse issues.
After note: a much worse, and more memorable parable, on the value of source control is the following. I remember a masters degree in mathematics candidate at UC Berkeley losing an entire draft of her dissertation as she accidentally typed “
rm * .log” instead of “
rm *.log” to clean-up side-effect files in her working directory. The extra space allowed the remove command to nuke important files. Without source control, this set her back a month.
For a nice lecture on the inevitability of errors (and thus why we need to mitigate them, as they can not be fully eliminated) I recommend The Lead Developer’s “Who Destroyed Three Mile Island” presentation.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.