I strongly advise using version control, and usually recommend using git as your version control system. Usually I feel a bit guilty about this advice as git is so general that it is more of a toolkit for a version control system than a complete proscriptive version control system (the missing pieces being the selection and documentation of a workflow and conventions).
But I still feel git is the one to use. My requirements involve not writing dot files in every single directory (breaks some OSX tools, and both CVS and Subversion do this), being able to work disconnected (eliminates Perforce), being cross-platform, being actively maintained, and being able to easily change decision such as where the gold standard repository lives (or even changing your mind on collaborating or not). This makes me lean towards BZR, git and Mecurial. Git is the most popular one of the bunch and has the most popular repository aggregator: GitHub.
For beginners I teach treating git like old-school RCS or SCCS: just use git to maintain versions of your local files. Don’t worry about using it to share or distribute files (but do make sure to back-up you directory in some way). To use git in this way you only need to run three commands regularly: “git status,” “git add,” and “git commit” (see Minimal Version Control Lesson: Use It). Roughly status shows you what is going on and add/commit pairs checkpoint your work. To work in this way you don’t need to know anything about branching (version control nerds’ favorite confusing topic), merging and so on. The idea is that as long as you are running add/commit pairs often enough any other problem you run into can be solved (though it make take an hour of searching books and Stack Overflow to find the answer). Git’s user interface is horrible (in part) because “everything is possible,” but that also means you can (with difficulty) solve just about any problem you run into with git (except, it seems, nested or dependent repositories).
However eventually you want to work with a collaborator or distribute your results to a client. To do that effectively with git you need to start using additional commands such as “git pull,” “git rebase,” and “git push.” Things seem more confusing at this point (though you still do not yet need to worry about branching in its full generality), but are in fact far less confusing and far less error prone than ad-hoc solutions such as emailing zip files. I almost always advise sharing work in “star workflow” where each worker has their own repository and a single common “naked” repository (that is a repository with only git data structures and no ready to use files) is used to coordinate (thought of as a server or gold standard, often named “origin”). This is treating git as if it were just a better CVS or SVN (the difference being if you want to perform a truly distributed step like pushing code to a collaborator without using the main server, you can and git will actually help with the record keeping). The central repository can be GitHub, GitLab or even a directory on a machine with ssh access. A lot of ink is spilled on how such a workflow doesn’t feel like a “distributed workflow,” but it is (you can work when disconnected from the central repository, and if the central repository is lost any up to date worker can provision a new central repository).
To get familiar with git I recommend a good book such as Jon Loeliger and Matthew McCullough’s “Version Control with Git” 2nd Edition, O’Reilly 2012. Or, better yet, work with people who know git. In all cases you need to keep notes, git issues are often solved by sequences of of three to five esoteric commands. Even after working with git for some time I still run into major “hair pullers.” One of these major “hair pullers” I run into is what I call “pseudo conflicts” and is what I am going to describe in this article.
When working with collaborators using git I eventually run into what I call a pseudo conflict. Both of us have been working in our own repositories and all of a sudden one of us runs into difficultly syncing the central repository to our personal repository. Usually this happens in a panic (we are syncing for a reason, often a deadline), but it is worth working through in detail. The details of the problem and the solution are both sufficiently complicated that I tend to not gracefully resolve these issues without keeping notes (which I have expanded into this article).
The usual shared workflow is:
- Work, work, work.
- Very often: commit results to the local repository using a “git add” “git commit” pair.
- Every once in a while: Pull a copy of the remote repository into our view with some variation of “git pull” and then use “git push” to push our work upstream.
Usually for simple tasks we don’t use branches and we use the “rebase” option on pull so that it appears that every piece of work is recorded into a simple linear order, even though collaborators are actually working in parallel. This is what I call an essential difficulty of working with others: time and order become separate ideas and become hard to track (and this is not a needless complexity added by using git, there are such needlessnesses; but this is not one of them).
Typically two authors may be working on different files in the same project at the same time. As we see in the figure below the second author to try to push their results to the shared repository must decide how to specify the parallel work was performed. Either they can say the work was truly in parallel (represented by two branches being formed and then a merge record joining the work) or they can rebase their own work to claim their work was done “after” the other’s work (preserving a linear edit history and avoiding the need for any merge records). Note: “before” and “after” are tracked in terms of arrows, not time.
The idea: is merging is what is really happening, but rebase is a much simpler to read. The general rule is you should only rebase work you have not yet shared (in our above example B should feel free to rebase their edits to appear to be after A’s edits as they have not yet successfully pushed their work anywhere). You should avoid rebasing things records people have seen as you are essentially hiding the edit steps they may be basing their work on (forcing them to merge or rebase in the future to catch up with your changed record keeping).
For some projects I try to use a rebase-only strategy. For example: our upcoming book has only two authors and we are only trying to create one final copy of the book (we are not trying to maintain older versions for other uses). If we always rebase the edit history will appear totally ordered (for each pair of edits one is always recorded as having come before the other) and this makes talking about versions of the book much easier (again: before is in determined by in arrows in the edit history, not by time stamp). But even with only two authors in close communication we sometimes run into merge conflicts and some of these conflicts are what I am calling “pseudo conflicts”. Pseudo conflicts are conflicts that if you look only at the final content there is no conflict. That is: any conflict in your history is in the extra record keeping you are performing to preserve a history (and not in the final content itself).
The most common pseudo-conflict we run into is: both authors perform the same edit on a file. One way we think this happens is we have complex application (like OmniGraffle or Keynote) open and some state or lock files accidentally get committed in one state and then re-committed in the original state. This could be fixed with more rigorous .gitignore files, but it can happen.
Normally git resolves transient conflicts automatically: when the second author rebases for their work to come after the first authors the second author’s (now redundant) edit is thrown away and there is no conflict (though the second author’s commit message is not shown in the revised history). The “gotcha” is when the second author has performed more than one commit before trying to pull/push and some of the intermediate commits do in fact generate conflicting content. And that is the issue: git wants all of the history to be consistent, not just the part of the history you are currently paying attention to.
A simple example is the first author makes a change to a given file and the second author makes two committed changes to the same file such that the end result is identical to the first author’s change. The end results are consistent, but one of the intermediate results is not (which causes a merge conflict). This is easier to illustrate with a concrete example. The bash script gitPseudoConflictExample.bash sets up two repositories (repoA and repoB) plus a shared “server” repository (hub.git) and prepares the pseudo conflict we have just described. The code to set up the below:
#!/bin/bash # clean out old examples rm -rf hub.git repoA repoB # create repoA with initial content mkdir repoA pushd repoA git init . echo "practice repo" > README.txt git add -A . git commit -m "first commit" popd # create central sharing hub git clone --bare repoA hub.git # push repoA to central repo pushd repoA git remote add origin ../hub.git git push origin master popd # create repoB git clone hub.git repoB # simulate some work in A pushd repoA echo "content A1" > fA.txt git add -A . git commit -m "A work add" git push origin master popd # simulate parallel work in B that ends in parallel state (without pulling from A) pushd repoB echo "content B1" > fA.txt git add -A . git commit -m "B work add" echo "content A1" > fA.txt git add -A . git commit -m "B work edit" popd
Here we are exploiting one of the awesome features of git: it is cheap to create repositories and even “servers” (in this case just the directory “hub.git”). So if you ever have a question “I wonder what happens in this corner case” you can set up an example and actually see what happens. At this point in our work example: if B tries to pull (with rebase) results into their repository with the following commands they ail with message “
CONFLICT (add/add): Merge conflict in fA.txt.”
# try to pull/push (and fail) cd repoB git pull --rebase origin master
At this point we see the conflict and can not go on. There are two main ways to fix this. The first is to abort the rebase and then admit the source branched during the simultaneous edits and merge it:
git rebase --abort git fetch git pull git log --graph --name-status git push origin master
We would only perform the pull step after finding which files are in conflict with “
git diff --name-only --diff-filter=U“, checking things with “
git status“, and examining the file in question (
fA.txt). The important thing is to make sure there are no uncommitted edits and that there are no unresolved merges in any of the files mentioned by diff (git merge markers are marked with strings like “>>>>>>”, “<<<<<<” and “======”). After the pull we have a split and merged history which can be inspected with the “git log” command or with the “gitk” GUI:
The command “
git log --graph --name-status” gives a useful visualization of the repository after this sequence. Notice the history splitting into two parallel time series and then re-joining at the merge record:
* commit 8062e84599fee5a9333450013ffd677eeeec20eb | Merge: c86857b 54eb3ad | | Author: John Mount | | Date: Wed Oct 30 09:54:01 2013 -0700 | | | | Merge branch 'master' of /Users/johnmount/Documents/writing/gitConflict/hub | | | * commit 54eb3adadd8201b19fa248aa868c305ec2685e54 | | Author: John Mount | | Date: Wed Oct 30 09:53:17 2013 -0700 | | | | A work add | | | | A fA.txt | | * | commit c86857b0e5b0cd7a9ed020126d38489f0d6b19e9 | | Author: John Mount | | Date: Wed Oct 30 09:53:17 2013 -0700 | | | | B work edit | | | | M fA.txt | | * | commit f64cec5af0d9784e544e8d2dd29ebf63771b47ec |/ Author: John Mount | Date: Wed Oct 30 09:53:17 2013 -0700 | | B work add | | A fA.txt | * commit 07e32d813c89ac853a2f2e490aaac1c50daa70b2 Author: John Mount Date: Wed Oct 30 09:53:16 2013 -0700
One important thing we can see is that there are no changed files in the merge: so no conflicts were edited during the merge. If we had edited content conflicts (and added them with “
git commit -a“) they would would be listed as “Conflicts” in “
git log --graph --name-status“.
The question is then: why do we even have these conflicts (and why are steps removed on rebase or merge-records written on merge)? The answer is: a git history is documenting that edits you made to a file (or a set of files) happen in a context or a state of the world. You may have read code, comments and notes from other files in the repository that you did not change. The requirement on git history is if two edits are made with different states of the world there should then be an explicit merge to document that they are in fact consistent edits. The hope is that somebody looked for issues when they merged, so you do not have to look for issues months later when maintaining the code.
A common example of two edits that look good in isolation, but don’t make sense when taken together would be the following. Suppose you were asked told: “your documentation of your
sqrt() function says it NaN on negative inputs, but the implementation just returns 0.” Developer A (our evil developer) revises the documentation (found in, say, a mark-down file) to say negative inputs are not allowed and have undefined behavior (weakening the code’s guarantees to users, a legitimate tactic but it tends to make the code using the function larger). In parallel developer B adds on if-statement to the code (found in, say, a C-file) to return NaN on negative inputs (making the code more robust, but making each call slightly more expensive). Each of these changes in isolation makes some sense, but taken together they are in semantic conflict. The new code compiles and works, but the changes really are moving in different directions.
For real conflicts you are going to have to merge. For pseudo-conflicts you have another option: use the “
rebase” command to collapse commits into a single big comment before attempting the “
pull“. This way there are no intermediate states, so if the final state is consistent everything is consistent.
The commands in this case are:
git rebase --abort git fetch git status git rebase -i HEAD~2 git pull --rebase origin master git log --graph --name-status git push origin master
The git status tells us how many commits we have not pushed yet. We copy this number the rebase command (in this case 2 becomes “
HEAD~2“). The rebase command will bring up an editor, in this editor change all but the first line that starts with “pick” to start with “s” (for “squash”). That is change the editor contents from this:
After that you will be asked to edit the commit message (you can leave it as is) and the collapse is complete. At this point the B’s view of the repository looks like this:
* commit ad827692b9558b86d3ffbf259d49ba782474ad6a | Author: John Mount | Date: Wed Oct 30 09:57:41 2013 -0700 | | B work add | | B work edit | | A fA.txt | * commit 70f9b02b338a53470fd16248afbe3e3add976257 Author: John Mount Date: Wed Oct 30 09:57:40 2013 -0700 first commit A README.txt
The two B actions have been merge into a single commit (hiding any intermediate state). After the “
git pull --rebase origin master” step B’s view looks like this:
* commit 321022b38a6fcb5a9293481dfd1fd4346dafa1ae | Author: John Mount | Date: Wed Oct 30 09:57:41 2013 -0700 | | A work add | | A fA.txt | * commit 70f9b02b338a53470fd16248afbe3e3add976257 Author: John Mount Date: Wed Oct 30 09:57:40 2013 -0700 first commit A README.txt
Notice the rebasing applied A’s identical change first and then threw out B’s commit (and messages) as being redundant after A’s work. The benefit of this more complicated procedure is a “tiedied up history” that is easier to read (“oh A, made an edit and it looks like B took their change”).
For more on collapsing commits see: Collapsing commits in Git.
And now you know a couple of ways to resolve git conflicts.
Git is rather interesting in that it automatically detects and manages so much of what you would have to specify with other version control systems (example: git find which files have changed instead of you specifying it, git also decides which files are related on its own). Because of the large degree of automation the beginning user usually a sever under-estimate of how much git in fact tracks for them. This makes git fairly quick except when git insists you help decide how a possible global inconsistency should be recorded in history (either as a rebase or a branch followed by a merge record). The point is: git suspects possible inconsistency based on global state (even when the user may not thing there is such) and then forces the committer to decide how to annotate at the time of commit (a great service to any possible readers in the future). Git automates so much of the record keeping it is always a shock when you have conflict and have to express opinions on nuances you did not know were being tracked. Git is also a “anything is possible, but nothing is obvious or convenient” system. This is very hard on the user at first, but in the limit is much better than a “everything is smooth, but very little is possible” version control system (which can leave you stranded).
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.