Sunday, July 13, 2008

Map Reduce and Parallelism

Map-Reduce is a powerful mechanism for parallel execution. Originally a set of semantics used in functional languages, map-reduce is now heavily employed in processing information in a variety of clustered environments. It's used by Google a lot, and Google has their own framework for handling map-reduce [1].

As would be evident, map-reduce is made up of two key concepts:

1. Map takes a list of data and applies an operation or transformation to each element of the list to produce another list of results. The general implication here is the results list is of the same magnitude as the source list. For example if you had a list of 1,000 numbers, after the map you'd have another list of 1,000 elements (be they numbers or not).
See: http://en.wikipedia.org/wiki/Map_(higher-order_function)

2. Reduce takes the list of results and compiles them in some fashion. Unlike Map, there is some kind of expectation of "reduction" or derivation of the data - that is if you had a list of 1,000, the result might be a list of 100, or a single number [2]. So a reduce could be anything from summing all the elements, to sorting them, or cutting them back to the top 100 - or any variant therein.
See: http://en.wikipedia.org/wiki/Fold_(higher-order_function)

Google list a number of advantages to the use of their map-reduce framework [1]:
- Automatic parallelisation and distribution
- Fault-tolerance
- I/O scheduling
- Status and monitoring

Most of these are operational benefits. However, the real benefit to Google in using map-reduce is within the first statement, the automatic parallelisation and distribution of processing. This is particularly important when processing very large data sets and/or responding to a user request - a user will only wait so long for a response. A user isn't going to wait ten minutes for a web search to return. So Google run the same search spread across thousands of machines, giving you a response in seconds (in reality, sub-second).

A fairly accessible example of a map-reduce operation is this kind of search. In this case, Map would take a function, such as "Score" and apply it to a large list of data - a large list of webpages. This score is representative of how well the webpage matches some criteria, such as search terms.

Reduce takes one or more list of scores and performs a "fold" or reduce. In this example, it would do something like take the list of scores and cut it back to the top 100, sorted from highest to lowest. This example of a reduce always produces 100 or less scores - give it 50 and it will produce 50 results (sorted), give it 1,000 and it will produce a list of the top 100 results (again, sorted).

For example, if I'm searching for "Jodoro", the Map looks at the 1,000 pages available and using the scoring operation it gives each page a score for the occurrence of "Jodoro" on the page. Reduce then looks at these 1,000 scores and whittles it back to the top 100. The pseudocode for this might look like:


Define MapReduce(Pages, Score, Reduce)
Scores = Map(Pages)
Result = Reduce(Scores)
Return Result
End

Define Map(Pages, Score)
Create a new list Scores
For each Page in Pages
PageScore = Score(Page)
Add PageScore to Scores
Next
End


This is all relatively pedestrian, but the power of the a Map-Reduce is that these can be performed in parallel. To illustrate this, here is a high level piece of pseudo-code:


Define MapReduce(Pages, Score, Reduce)
Split Pages into 4 Lists, producing FirstPages, SecondPages, ThirdPages, FourthPages

FirstScores = Map(FirstPages)
SecondScores = Map(SecondPages)
ThirdScores = Map(ThirdPages)
FourthScores = Map(FourthPages)

FinalScore = Reduce(FirstScores, SecondScores, ThirdScores, FourthScores)

Return FinalScores
End


This map-reduce operation splits the scoring process into four pieces that are individually mapped, then ranked using the reduce function. So if you had 400 pages, you would end up with four lists of scores each 100 long (FirstScores through FourthScores). The Reduce function takes these four 100 scores and produces a list of the top 100.

It's probably worth pointing out at this stage that this is a complete search. We've not used heuristics, we've examined every page and determined the "best" result. We could have cheated. For example if a page scored very low, we might not have bothered to include it in the scored list [3]. Whilst these heuristics are useful, they are not guaranteed. However, in specific cases such as search, it's probably appropriate to discard very low results.

In the case of this example, another variant might be:


...
FirstScores = Score(FirstPages)
SecondScores = Score(SecondPages)
Scores1 = Score(FirstScores, SecondScores)

ThirdScores = Score(ThirdPages)
FourthScores = Score(FourthPages)
Scores2 = Score(ThirdScores, FourthScores)

FinalScore = Reduce(Scores1, Scores2)
...


In this case, we Reduce three times. First we take the first two results and combine them, then the second two - and finally we take the two aggregates to produce a final result.

So what does this all mean? Well, the key is that the mapping piece can occur entirely in parallel - or to be more correct, the scoring can occur in parallel. All of the four "Score" pieces above could occur independently. So you could run the four Score pieces on four different systems in order to process them faster. Google MapReduce or the open source Hadoop framework [4] take this parallelism a step further. Instead of breaking the Pages into four (or whatever) arbitrary pieces, they create a large pool of servers - each with their own set of Pages [5]. When you invoke a map-reduce, a Master server asks all of the servers to undertake the individual map on their particular data, producing the score or otherwise derived result. These scores are then "reduced" to the actual list of results [6]. So the split in the example pseudocode is actually achieved by partitioning the data across the multiple systems.

All this seems pretty straightforward and useful. However, there is a sting in the tail. What isn't made clear by this explanation is a key underlying premise - That the individual score operations are independent. If your score operation was somehow dependent or interlocked with another score operation, then clearly they cannot operate in parallel. This almost certainly not the case when matching a set of search terms against a webpage - this is exemplified by the search example. A score can occur on a webpage on one system completely independently of the scoring that is occurring on another physically distinct system.

It's probably also worth noting that the same could be said for Reduce - in our examples we showed Reduce being used either once or three times. The presumption is that the Reduce doesn't modify the underlying scores and that there are no other side-effects. If it did, the multiple uses of Reduce may also produce an unexpected result.

Joel Spolsky validly points out that an understanding of functional languages assists with understanding map-reduce [7] as map-reduce fundamentally came from semantics in languages such as Lisp. This is largely true, but the semantics of map-reduce do not necessarily imply parallel execution. Certainly the first functional implementations of map and reduce weren't intended for parallel operation at all.

So adding the right semantics into a language to give you map-reduce doesn't necessarily make the code parallel, whilst it does open up possibilities. In our search example, the structure of the data (Pages) and the independent operation of the mapping operation (Score) is absolutely fundamental to having the map-reduce operate in parallel. The use of map-reduce primarily provides a metaphor where these work together in a logical, consistent way.

If you had a piece of Java code do your scoring, then it's totally feasible that you don't know anything about the scoring implementation (as with a functional language, that's part of the point). However, you'd need to be careful that this piece of code isn't dependent on some kind of resource - it could be manipulating private variables, writing to disk, generating random numbers, or (god forbid) relying on some kind of Singleton instance. For example, your Score may initially generate a random number that is used in all the subsequent scoring operations. If you spread this across four machines you might have the unexpected side-effect that they are operating using a different random number. Whatever the technicalities, in some circumstances it may be very difficult to establish how independent an arbitrary Score implementation is. This is somewhat of an irony - as this kind of abstraction (i.e. an arbitrary score function) is key to a functional language.

As another example, something like the patented Google PageRank [8], relies on the linkages between pages in order to score their importance. What is important here is that the score of a page is derived from the importance of the pages that link to it. In this case, you'd need to be careful about how you built your Score function (to Rank the pages) as the score of an individual page is dependent on the scores of others [9].

So - it isn't the map-reduce that is operating in parallel. It's the means of definition of the source data and the transformation that is the first enabler to parallel operation. You can't necessarily take map-reduce and apply it wholesale to do an existing complex financial risk analysis, for example. There are likely complex algorithmic dependencies that simply cannot be accommodated - certainly at least without recasting the problem or data in a manner that suits the model. A solution to problem might actually require a change in data definition, or might require a the design several map-reduce operations to operate. Or simply, the problem might have so many implicit algorithmic dependencies that map-reduce isn't appropriate at all.

Fundamentally whilst map-reduce is useful at increasing the level of parallelism in your code, it doesn't intrinsically give you parallel processing. What it does give is a common metaphor for this type of processing, enabling the development of a framework - and all the benefits a framework brings. If you need to do parallel processing, you still need to focus on the composition of the problem.

jon@jodoro.com
...


[1] Jeffrey Dean and Sanjay Ghemawat, http://labs.google.com/papers/mapreduce.html, December 2004
[2] For something like a sort it could return the same number of elements. I can't think of a valid example where a Reduce would *increase* the number of elements. So reduce is probably a pretty apt term.
[3] Google would almost certainly do this with a Web search.
[4] See http://hadoop.apache.org/
[5] Reality is that they will have a lot of redundant nodes in order to scale, but this is the basic premise.
[6] In the case of a Google Web Search, the reduce is to reduce the set of results to the top 1,000 results. You never get more than 1,000: For example, http://www.google.com/search?q=the&start=990
[7] Joel Spolsky 'Can Your Programming Language Do This?', http://www.joelonsoftware.com/items/2006/08/01.html, August 2006
[8] See http://www.google.com/corporate/tech.html
[9] Clearly Google has approaches to solve this dependency, but it is useful in illustrating the point.


1 comment:

Anonymous said...

Thanks for this post! I’ve already wasted an hour trying to solve this “problem”, but I won’t have to waste any more time now!!