“Sorting Used in Anger” (A rambling glimpse into the mind of a theorist)
Author: John Mount
4-24-2008
The other day I had a bit of time to kill before an appointment. Luck was with me: there was a nearby bookstore and I was able to pass some of the time skimming through a book called “Beautiful Code.” Everything started out fun and nostalgic. The book title reminded me of “The Art of Computer Programming” (a book that probably did as much through the grace of its title as it did through its incredible contents to attract minds into theoretical computer science). One of the chapters of “Beautiful Code” was by Jon Bentley (a hero of sharp reasoning and clever coding) and as I flipped to the chapter my day was ruined. There it was: Quicksort an algorithm that I have a long love and hate relationship with.
Bentley’s code (combining his skill with a really good idea Bentley credits to Nico Lomuto) is indeed a masterpiece:
void quicksort(int l, int u) { int i, m; if (l >= u) return; swap(l, randint(l, u)); m = l; for (i = l+1; i < = u; i++) if (x[i] < x[l]) swap(++m, i); swap(l, m); quicksort(l, m-1); quicksort(m+1, u); }
How can I have such strong feelings for something so abstract and small as this algorithm? Part of the answer is that I had just the other day written some Quicksort code that, until I flipped open the book, I had been quite proud of. The rest of the answer is that Quicksort and I have a history.
Coding is all about compromises. Some of the greats in our field work to teach us simplicity and purity but are often accused of being idealists. Not so: their point was that there will be more than enough compromise and lessening of your vision as your project progresses. So to have anything left you must start with a lot (which we usually do not). C. A R. Hoare himself (the inventor of Quicksort) worked hard to build frameworks that were able to prove small programs like Quicksort functioned correctly. This effort is often criticized as being impractical for large software systems. But you cannot build large systems without a library of well understood reliable components. Why tolerate any unnecessary flaws when you will certainly face plenty of necessary flaws?
How I have managed and negotiated compromises has been a major source of pride in my work. I have pushed hard to move “nimble activities” such as research and prototyping away from looser languages (like Perl) into safer and stricter (though somewhat more tedious to work with) languages like Java.
This isn’t a pure position, it represents compromises and has consequences. Java itself has a number of unnecessary flaws. One flaw is that Java is very inefficient in storing huge collections of small objects. That is: Java is bad at a task that underlies a lot of the basic record keeping needed for a lot of the research I do (machine learning and data mining). I had, until opening “Beautiful Code,” been feeling very good about a compromise I had recently made. To avoid giving up on Java (and moving to an even more tedious language like C++) I figured out how to store all of my many many small bits of information without creating a great number of objects. By introducing complicated code I could stay in Java and keep many of the advantages of the language. But, the compromise was my data lost access to a number of essential Java supplied services- such as sorting and lookup.
The solution was to re-implement sorting. Re-imlementing sorting in this day and age (especially in a language such as Java that supplies some great sorted and ordered data structures) seems stupid. But none of the Java built-in functions are willing to sort a large collection of data without create a great many objects (which is prohibitively wasteful). If I had code that was able to sort my “poor man’s records” directly I would have a complete solution. The most expedient sort procedure to code-up is Quicksort (which tends to run very well and does not need a lot of extra space for record keeping). So I coded up what I that was a pretty clean and clever Quicksort for my problem. Had I more time I would have preferred to use Mergesort which has better guarantees that Quicksort (Quicksort is usually very fast, Mergesort is always fast).
Seeing Bentley’s code I realized my code was “bloated and flabby.” What I did in four separate ways he did in one. My Quicksort was quick and correct, but not perfect. I began to think that perhaps more of my reasoning was flabby and expedient. Should I have moved the whole project out of Java? Should I have implemented Mergesort? Why did my code seem so filled with fret and worry?
Then I saw it. Bentley’s example was wrong. The code was so unworried that it failed to anticipate an important possibility. When you are really thinking hard about an algorithm you can often simulated it in your head. Bentley’s clear writing made thinking easier. Suppose you gave Bentley’s Quicksort code a set of items to sort that were all identical. Then the “if”-statement would never be true and the variable “m” would never be increased. At the end of each run you would always have “m” pointing to the left-most portion of the array (despite the clever choice of pivot made earlier in the code). The Quicksort procedure depends on calling itself to re-sort shorter arrays, but in this case the re-sort would be on arrays only a single item shorter than the original. This means Quicksort would be doing about the same amount of work over and over again- instead of working on smaller and smaller pieces. Quicksort would be very very slow.
Why do I notice this? Part of it is my training: theoretical computer scientists are trained to work examples and trained to worry. The other part is that I had been burned by this in the past. Years ago I had been with a group doing very interesting biotech research on what at the time were moderately large data sets. One of the clever steps was that the group had reduced a lot of the research steps to standard computer science techniques like sorting and joining (much like Google’s famous MapReduce reduces complicated computations to primitives). Then the standard “commercial package” failed us. The built-in sorting function was often very very slow. The often happened if there were a lot of repeated values in the data we were trying to sort.
Years ago we fixed the problem by implementing our own Mergesort. Mergesort is a bit harder to implement than Quicksort but there are no “special situations” that make Mergesort slow down. Quicksort itself can, at the loss of some elegance, be coded in such a way to avoid any problems with duplicate entries (as I had recently done). However, there is always a unlucky data set that Quicksort will be slow on. The unlucky data set is rare, but it can happen (unlike with Mergesort).
Many succinct implementations of Quicksort have this flaw (nearly guaranteed slowness when there are many duplicate keys in the data to sort). I ran into this bug in production in the 1990s in Microsoft’s C++ libraries of the time.
One thing to look for is if when Quicksort calls itself is it using a single mid-value variable to represent the partition (like the “m-1” and “m+1” or does it use two variables to try and suppress all values similar to the pivot. For instance the code found at http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Sort/Quick/ (by Lloyd Allison who attributes it to Niklaus Wirth’s 1986 book “Algorithms and Data Structures”) performs some careful accounting to try and avoid allowing duplicate elements into the split in addition to the code to do this and the comments documenting what is being attempted the key they to look for is the use of two separate variables (“right” and “left” in addition to “lo” and “hi”) in the recursive call:
quicksort(a, lo, right);// divide and conquer quicksort(a, left, hi);
One think that always fundamentally bugged me about Quicksort: why settle for a procedure that “almost always works quickly” when there are other procedures that “always works quickly?” Why would someone like C. A. R. Hoare promote a needless compromise?
And that is just it- it isn’t needless. We live in the real world with bounded resources, compromise is inevitable. If you want to sort a single list and guarantee it will be done quickly you really should be using a provably good sorting procedure (like Mergesort). If, however, you are asked to solve many sorting problems then a procedure like Quicksort is so often faster than others that even when you add in the time lost on the bad rare-situations (which you will eventually run into) you will still be (on-average) ahead for using Quicksort. It is a compromise having to pick a “most appropriate method” instead of a “best method” (because “best” varies by situation) but it is not an unnecessary compromise. Seeing and managing there trade-offs is the essence of design and programming.
Categories: Computer Science Opinion
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
I found out that Tim Peters (of PythonLabs) wrote a very nice article covering these kind of issues with QuickSort in his introduction for chapter 5 of the Python Cookbook (2nd Edition, O’Reilly 2005). Hoare’s original implementation (as a junior programmer) won a bet that he could not come up with anything faster than the highly tuned system sorting function (a great example of “coding in the small”). Of course in the wild a lot of other issues show up (minimizing object creation, applying permutations to external lists, non-deterministic floating point implementations, heap structures and stable sorting).
Finally got around to going through Bentley’s masterpiece Programming Pearls Second Edition. The tone is definitely different than that in Beautiful Code Bentley calls out bad performance and bugs much earlier (and expresses an aversion to reading others bug-filled “corrections”). Column 11 of Programing Pearls presents 4 quick sorts, all of which use a single split point (but some smarts to avoid the quadratic run time on constant data).
This sort of problem with sorting seems like it has always been with us. For example: Bentley quoting Knuth: “in section 6.2.1 of his ‘Sorting and Searching,’ Knuth points out while the first binary search was published in 1946, the first published binary search without bugs did not appear until 1962” (Programming Pearls 2nd edition, “Writing Correct Programs”, section 4.1, p. 34).
See also: Nearly All Binary Searches and Mergesorts are Broken (2006) http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html