Menu Home

Map Reduce: A Good Idea

Some time ago I subscribed to The Database Column because it would be fun to see what these incredible people wanted to discuss. We owe much of our current database technology to Professor Stonebraker and Vertica sounds like an incredible product. And I definitely want to continue to subscribe.

However, the reading experience is marred by some flaw in their RSS system that keeps marking the article “MapReduce: A major step backwards” as a new article. This causes the article to appear in my RSS reader every few weeks as “new.” This wouldn’t bother me too much except that the article runs so counter to experience that it is itself offensive.

I have no desire to defend Google (the home of MapReduce)- they don’t need it and are clearly laughing all the way to the bank. However the points used to kick at MapReduce are so broad and so devalue practitioner experience that they are insulting. I find the individual arguments offensive and wish to stand against them. I am not that concerned about the conclusion, use MapReduce or don’t. For some things MapReduce is a good tool and for some things it is not.

Let’s limit ourselves to the 5 primary complaints from the article. The article (verbatim) says MapReduce is:

1. A giant step backward in the programming paradigm for large-scale data intensive applications.

2. A sub-optimal implementation, in that it uses brute force instead of indexing.

3. Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago.

4. Missing most of the features that are routinely included in current DBMS.

5. Incompatible with all of the tools DBMS users have come to depend on.

Now let us comment:

1. “A giant step backward in the programming paradigm for large-scale data intensive applications.”

Actually, no.

MapReduce represents a continuity in a stream of ideas that made UNIX great: composable transient tools. Not everything is a database or data warehouse. A lot of the grungy UNIX tools (like sort, sed, awk, join) have often been combined to do large scale (at the time) research because they all worked “out of core” fairly well. This makes for a horrible bailing-wire set-up. However, it often handles problems of a size much larger than would have been possible on the hardware at the time.

In addition the author trots out the “it’s Codasyl all over again” argument. This argument refers to the ongoing pain and expense derived from binding algorithmic details too close to the data representation. In earlier writing it was a fantastic point that warned that the up and coming object oriented databases were going to be the same nasty pointer chasing nightmares that hierarchical databases had been. I can see why an author might feel that just saying “it’s Codasyl” could win any argument.

2. “A sub-optimal implementation, in that it uses brute force instead of indexing.”

MapReduce does not use brute force.

MapReduce uses the idea (one that goes back to merge sort) that parallel traversals (that is: running through two lists in the same order synchronously) are a very powerful technique that can, among other things, produce indices. MapReduce is so efficient that it has been shown to be competitive with the best large scale sorting algorithms on their home-turf: sorting.

MapReduce looks brutish because it drops a lot of popular design features. One such feature is trying to speed things up through local caching and combining. However, on the data that MapReduce is commonly used (free form written text) it is a provable property of the data that local caching is an ineffective complication (due to the heavy-tailedness of the data). So many of the graceful features missing from MapReduce are actually no help on the types of data it is used on. There is a certain grace in leaving only only the features that are actually helping.

3. “Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago.”

A nasty attack.

MapReduce is a good explanation of some merging techniques that have been common knowledge for quite a while. This is a legitimate expository goal: explaining something everybody already “knows” better. In fact this is very hard to do and considered a legitimate accomplishment in many fields (for example Rota stated it was a legitimate goal in mathematics). I myself looked at some of my own older code for dealing with very large data sets after reading the MapReduce paper. I saw that the paper was describing what I was already doing (splitting the data into streams for later re-joining) and explaining it so well that it was now a method and no longer a hack. When a paper successfully teaches about you something you already “know” it is a good work.

The attack is is also inaccurate- the ideas are not 25 years old it is closer to 120 years old.
We could easily trace the lineage of MapReduce back to Hollerith style sorting machines that pre-date general purpose computers (i.e. going back to before 1889) . MapReduce refers back to a time when all computation was performed by what we now call external sorting and tabulation. These 19th century technologies may seem archaic but they were developed in a word similar to ours: worlds where the amount of data is in excess of your conveniently reconfigurable computational resources.

4. “Missing most of the features that are routinely included in current DBMS.”

Unfortunate.

I miss a lot of those features.

However, because MapReduce is such a lean technique I have seen engineers implement their own MapReduce systems in a day (to solve a problem they are working on). That is they are successfully sorting, joining, indexing and summarizing hundreds of gigabytes of data on a consumer PC within a couple of days of being asked to. This is from scratch after reading the MapReduce paper.

The “make versus buy” decision should not always come out “make.” But it is not wise to artificially bloat up requirements so that the decision can only be “buy.”

5. “Incompatible with all of the tools DBMS users have come to depend on.”

Good.

Frankly for a lot of analytic practitioners many DMBS systems and tools have become expensive obstacles in the way getting results. Yes, we enjoy humiliating an interview candidate that does not know all of the Codd normal forms (or can’t remember which of the alphabet soups of OLTP or OLAP is the “good one” ) as much as the next person. But to many of us a lot of these tools and procedures are more obstacles than a solutions.

This may sound nasty, but if were not the case why would companies like Vertica be producing radical new database tools? The fact is existing DBMS tools were designed for a different type and scale of data than we regularly see on the web (and column oriented database designers seem to share this view). The situation is so bad that “roach motel” is a common analyst’s slang for “data warehouse” (derived from: “data checks in but it never checks out”).

This isn’t meant to be a hagiography of MapReduce, but given that MapReduce has paid the bills I feel it deserves a small show of respect along the lines of “dance with the one who brung you.”

MapReduce is not a panacea. One of the tasks I have hated most in my career was maintaining a seven step MapReduce based system. I would love to have avoided all the detail fiddling that set-up required. However, the system paid our bills by performing a calculation that was beyond the scale of simpler methods and it would have been unaffordable to buy a solution.

Categories: Computer Science Opinion

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

1 reply

%d bloggers like this: