Sunday, July 20, 2008

Defining Cloud Computing

Generally, when a new fad [1] comes along in the software industry there is a fantastic initial rush of enthusiasm, buoyed by the initial success and enthusiasm of the innovators...

...This is generally followed by a series of attempts to define just what the fad is. We've seen this in recent history with topics such as Service Orientated Architecture (SOA) or Web 2.0 - or even things that seem very specific at the outset, like AJAX. There are countless other examples going back further.

This is now the case with the term Cloud Computing. Cloud Computing is difficult to define, but it's very generally about the commoditisation of computing resource. Instead of buying a server, with a specific processing and storage capacity - you simply buy an allocation of capacity from a vendor, a capacity that you can then increase or decrease as necessary. Instead of having a number of discrete resources (e.g. Ten disks on ten different servers), you get access to storage as one, virtual service.

It seems a simple proposition from many perspectives, but in reality the distinction is pretty subtle. After all, buying a physical server in principle achieves the same result - you pay and get more capacity. However, the real difference is in the definition of the service you get, and the mechanics for accessing that service - which is where the definition gets complicated.

Recently on Slashdot (Multiple Experts Try Defining "Cloud Computing"), there was an reference to a SYS-CON article on the definition of Cloud Computing - Twenty Experts Define Cloud Computing [2].

So Cloud Computing is not unique in that is lacks an explicit boundary. Definitions in the SYS-CON Article [2] varied quite a bit. Some of the commentary included:
  • ..the broad concept of using the internet to allow people to access technology-enabled services
  • Most computer savvy folks actually have a pretty good idea of what the term "cloud computing" means: outsourced, pay-as-you-go, on-demand, somewhere in the Internet, etc.
  • Clouds are the new Web 2.0. Nice marketing shine on top of existing technology. Remember back when every company threw some ajax on their site and said "Ta da! We’re a web 2.0 company now!"? Same story, new buzz word..

One of the definitions I particularly liked was by Yan Pritzker [3]. This definition makes three key points (refer to the article for the full explanation):
  • Clouds are vast resource pools with on-demand resource allocation.
  • Clouds are virtualized.
  • Clouds tend to be priced like utilities.

All of these definitions are generally from respected pundits and enthusiasts in the area. In the formative (storming) stage of a technology these are often very insightful and useful. However, as the audience and participants grow, this is diluted very quickly. In my experience, the definition of a fad is (eventually) corrupted by these definitions - In particular, as software and hardware vendors inevitably take control of the spotlight.

It's very common in my experience to see vendors come in with the following line - (a) here is this great fad and how it's changing the world, (b) it's obviously complex, so here is a definition of it that you can digest - and (c) here is our our products... Coincidentally, our products fit that definition exactly.

This is perhaps cynical. It's perfectly natural for a vendor to define their strategy in line with industry thinking and then produce their product suite to fit. Arguably, IBM know quite a bit about Cloud Computing. IBM has been pushing the "On Demand" brand across their various offerings since 2002 [4]. This is not to say that IBM's On Demand necessarily fits some of the definitions of Cloud Computing [5] - but it's certain that IBM will fold the buzz Cloud Computing into it's strategy around On Demand [6].

However, the cynicism around the action of vendors in this space is not unfounded. There are a lot of products that suddenly become SOA or BPM or AJAX once the term becomes a hot property in the industry. Or you simply take an existing product suite, add another product or extension that fits the mould, and rebrand the lot. This also highlights one good strategy for a startup - Find an existing, established vendor, spot some kind of technology gap or inconsistency, develop against the gap and get acquired.

There is an underlying reason that this definition exercise is possible. To an outside party it may be clear that a new approach or technology has potential. This is driven by the outstanding success of the innovators and early adopters. This success often generates a justified amount of buzz in the industry. So there is potential, and a new way of doing things - and generally organisations have problems to solve. So the question inevitably arises - "This sounds good, but what can it do for me?".

Whilst this generally results in a bewildering number of answers, usually designed to push product, the fact is that often the cornerstones underlying these fads are relatively straightforward. SOA really has it's roots in Web Services - and although the two are theoretically independent, they have walked hand in hand through the hype cycle. Every presentation you will see will belie this fact, in fact they will often state the exact opposite "SOA is not Web Services" [7]. This is somewhat true, but this often leads to a definition that is a tautology.

These snippets usually get entangled for good reason. The fact is, just adopting Web Services (or whatever equivalent) simply isn't the end of the story. In fact, it's often just the first step in thousands. You have Services now, so you need to lifecycle manage them, instrument them, align them to Business Processes to get "true business benefit and reusabily", etc, etc. Most of these aren't new problems at all, but the componentisation around the original Service approach (the Web Service) has changed the unit of work, and changed how a lot of moving parts are now expected to interconnect - software, hardware, processes and people. So something like SOA moves in to fill these gaps, which clearly can be nebulous in the extreme.

So what you have is a shift in technology that quickly becomes a bigger beast. Now that you have this technology, you need the consummate shifts in the methods, tools and outlooks that support this technology. Sometimes these shifts are on the periphery, sometimes they are orders of magnitude greater than the original technology driver. This is where the definitions become nebulous. e.g. "Now that I'm building Services, which ones should I build and how do I maintain them?"

In that sense, Cloud Computing falls into a similar trap. Accepting the fundamental technology premise around Cloud Computing opens up a lot of questions. Cloud Computing may be quite simple technically, but there are a whole raft of other implications that you need to deal with - many of which are old ground. I'm sure Mainframe pundits look at concepts such as SOA and Cloud Computing and say "but we've already solved that". In fact many cynics I speak to in the industry label these initiatives as the industry's desire to get back to the Mainframe - arguable, but you can usually see their point. CORBA pundits are probably part of a similar chorus.

Irrespective, something like Cloud Computing puts these combinations together in unique ways. So whilst you're treading old ground, some of the solutions might not necessarily be the same. For example, one of the keys in Cloud Computing could be achieving a cost basis that drives economies of scale.

So, to contradict myself and provide a definition - here are some of the technology keys that I see as driving Cloud Computing:
  1. Even thought it might be transparent to an end user, Cloud Computing is run on "Commodity Hardware". This is a loose definition in itself, but generally means off-the-shelf x86 servers, often running Linux.
  2. Further to this, instead of a handful of Big Iron servers, Cloud Computing tends to rely on a large cluster of these commodity machines.
  3. Cloud Computing relies on increases in bandwidth to allow access to processing and storage resources via the Internet.

What this generally means is that:
  • Your computing services are cheaper because they are using cheaper hardware.
  • Your computing services are cheaper because utilisation is higher - if you're not using your capacity, someone else probably is.
  • It's easier to scale, as you are tapping the bigger resource pool that a single vendor can offer [8].
  • The Quality of Service (e.g. Reliability) is tuned by the inherent capacity and redundancy of the cloud. The systems aren't specified to be ultra-reliable, the resilience of a cloud is driven by having redundant nodes to take over the in the result of failure.
  • You were reliant on external parties before (e.g. ISP, Web Host), but you're increasingly reliant on them "higher up the stack".
  • Cloud Computing relies less on references that refer to physical assets (e.g. a fileshare on a specific server), but instead virtualises these for the customer. You just need a reference to the file, and a place to ask for it - you don't care where it actually resides.

Whilst most of these technology shifts are all about "Business Benefit" of some sort, it's often useful to take a concept back to it's absolute fundamentals to put it in perspective. Much of the surrounding definition often points to a philosophy or method - underlying all of this is often a series of technical shifts that is prompting this.

Going back to the technology definition can obscure some of the more conceptual benefits, but it also gives perspective - many of the pitfalls with a new technology have already been encountered before. The SYS-CON [2] links to a Paul Wallis' definition [9] highlights this fact further.

...And perhaps most of all, the key technologies are simply the easiest part to nail down.


jon@jodoro.com
...


[1] Fad is perhaps a bit of an unfair term - A fad could actually be a very significant event.
[2] Twenty Experts Define Cloud Computing, http://cloudcomputing.sys-con.com/read/612375_p.htm
[3] Yan Pritzker, Defining Cloud Computing, http://virtualization.sys-con.com/read/595685.htm
[4] J. Hoskins, B Woolf, IBM on Demand Technology Made Simple, 3rd Ed, Clear Horizon
[5] IBM's On Demand Strategy is hard to pin down specifically, it is a horizontal branding applied across a lot of the IBM silos. A search on the ibm.com site returns over 100,000 hits alone. Also see Google Search: http://www.google.com.au/search?q=ibm+%22on+demand%22
[6] M. Darcy, IBM Opens Africa's First "Cloud Computing" Center, Second Cloud Center in China, http://www-03.ibm.com/press/us/en/pressrelease/24508.wss
[7] To an extent this is absolutely true, but it isn't really true in practice. Standards under most SOA definitions are key, and Web Services are absolutely the cornerstone of this.
[8] However, just because your infrastructure scales, this doesn't mean your specific solution on that infrastructure does.
[9] P. Wallis, "A Brief History of Cloud Computing: Is the Cloud There Yet?", http://cloudcomputing.sys-con.com/read/581838.htm.

Sunday, July 13, 2008

Map Reduce and Parallelism

Map-Reduce is a powerful mechanism for parallel execution. Originally a set of semantics used in functional languages, map-reduce is now heavily employed in processing information in a variety of clustered environments. It's used by Google a lot, and Google has their own framework for handling map-reduce [1].

As would be evident, map-reduce is made up of two key concepts:

1. Map takes a list of data and applies an operation or transformation to each element of the list to produce another list of results. The general implication here is the results list is of the same magnitude as the source list. For example if you had a list of 1,000 numbers, after the map you'd have another list of 1,000 elements (be they numbers or not).
See: http://en.wikipedia.org/wiki/Map_(higher-order_function)

2. Reduce takes the list of results and compiles them in some fashion. Unlike Map, there is some kind of expectation of "reduction" or derivation of the data - that is if you had a list of 1,000, the result might be a list of 100, or a single number [2]. So a reduce could be anything from summing all the elements, to sorting them, or cutting them back to the top 100 - or any variant therein.
See: http://en.wikipedia.org/wiki/Fold_(higher-order_function)

Google list a number of advantages to the use of their map-reduce framework [1]:
- Automatic parallelisation and distribution
- Fault-tolerance
- I/O scheduling
- Status and monitoring

Most of these are operational benefits. However, the real benefit to Google in using map-reduce is within the first statement, the automatic parallelisation and distribution of processing. This is particularly important when processing very large data sets and/or responding to a user request - a user will only wait so long for a response. A user isn't going to wait ten minutes for a web search to return. So Google run the same search spread across thousands of machines, giving you a response in seconds (in reality, sub-second).

A fairly accessible example of a map-reduce operation is this kind of search. In this case, Map would take a function, such as "Score" and apply it to a large list of data - a large list of webpages. This score is representative of how well the webpage matches some criteria, such as search terms.

Reduce takes one or more list of scores and performs a "fold" or reduce. In this example, it would do something like take the list of scores and cut it back to the top 100, sorted from highest to lowest. This example of a reduce always produces 100 or less scores - give it 50 and it will produce 50 results (sorted), give it 1,000 and it will produce a list of the top 100 results (again, sorted).

For example, if I'm searching for "Jodoro", the Map looks at the 1,000 pages available and using the scoring operation it gives each page a score for the occurrence of "Jodoro" on the page. Reduce then looks at these 1,000 scores and whittles it back to the top 100. The pseudocode for this might look like:


Define MapReduce(Pages, Score, Reduce)
Scores = Map(Pages)
Result = Reduce(Scores)
Return Result
End

Define Map(Pages, Score)
Create a new list Scores
For each Page in Pages
PageScore = Score(Page)
Add PageScore to Scores
Next
End


This is all relatively pedestrian, but the power of the a Map-Reduce is that these can be performed in parallel. To illustrate this, here is a high level piece of pseudo-code:


Define MapReduce(Pages, Score, Reduce)
Split Pages into 4 Lists, producing FirstPages, SecondPages, ThirdPages, FourthPages

FirstScores = Map(FirstPages)
SecondScores = Map(SecondPages)
ThirdScores = Map(ThirdPages)
FourthScores = Map(FourthPages)

FinalScore = Reduce(FirstScores, SecondScores, ThirdScores, FourthScores)

Return FinalScores
End


This map-reduce operation splits the scoring process into four pieces that are individually mapped, then ranked using the reduce function. So if you had 400 pages, you would end up with four lists of scores each 100 long (FirstScores through FourthScores). The Reduce function takes these four 100 scores and produces a list of the top 100.

It's probably worth pointing out at this stage that this is a complete search. We've not used heuristics, we've examined every page and determined the "best" result. We could have cheated. For example if a page scored very low, we might not have bothered to include it in the scored list [3]. Whilst these heuristics are useful, they are not guaranteed. However, in specific cases such as search, it's probably appropriate to discard very low results.

In the case of this example, another variant might be:


...
FirstScores = Score(FirstPages)
SecondScores = Score(SecondPages)
Scores1 = Score(FirstScores, SecondScores)

ThirdScores = Score(ThirdPages)
FourthScores = Score(FourthPages)
Scores2 = Score(ThirdScores, FourthScores)

FinalScore = Reduce(Scores1, Scores2)
...


In this case, we Reduce three times. First we take the first two results and combine them, then the second two - and finally we take the two aggregates to produce a final result.

So what does this all mean? Well, the key is that the mapping piece can occur entirely in parallel - or to be more correct, the scoring can occur in parallel. All of the four "Score" pieces above could occur independently. So you could run the four Score pieces on four different systems in order to process them faster. Google MapReduce or the open source Hadoop framework [4] take this parallelism a step further. Instead of breaking the Pages into four (or whatever) arbitrary pieces, they create a large pool of servers - each with their own set of Pages [5]. When you invoke a map-reduce, a Master server asks all of the servers to undertake the individual map on their particular data, producing the score or otherwise derived result. These scores are then "reduced" to the actual list of results [6]. So the split in the example pseudocode is actually achieved by partitioning the data across the multiple systems.

All this seems pretty straightforward and useful. However, there is a sting in the tail. What isn't made clear by this explanation is a key underlying premise - That the individual score operations are independent. If your score operation was somehow dependent or interlocked with another score operation, then clearly they cannot operate in parallel. This almost certainly not the case when matching a set of search terms against a webpage - this is exemplified by the search example. A score can occur on a webpage on one system completely independently of the scoring that is occurring on another physically distinct system.

It's probably also worth noting that the same could be said for Reduce - in our examples we showed Reduce being used either once or three times. The presumption is that the Reduce doesn't modify the underlying scores and that there are no other side-effects. If it did, the multiple uses of Reduce may also produce an unexpected result.

Joel Spolsky validly points out that an understanding of functional languages assists with understanding map-reduce [7] as map-reduce fundamentally came from semantics in languages such as Lisp. This is largely true, but the semantics of map-reduce do not necessarily imply parallel execution. Certainly the first functional implementations of map and reduce weren't intended for parallel operation at all.

So adding the right semantics into a language to give you map-reduce doesn't necessarily make the code parallel, whilst it does open up possibilities. In our search example, the structure of the data (Pages) and the independent operation of the mapping operation (Score) is absolutely fundamental to having the map-reduce operate in parallel. The use of map-reduce primarily provides a metaphor where these work together in a logical, consistent way.

If you had a piece of Java code do your scoring, then it's totally feasible that you don't know anything about the scoring implementation (as with a functional language, that's part of the point). However, you'd need to be careful that this piece of code isn't dependent on some kind of resource - it could be manipulating private variables, writing to disk, generating random numbers, or (god forbid) relying on some kind of Singleton instance. For example, your Score may initially generate a random number that is used in all the subsequent scoring operations. If you spread this across four machines you might have the unexpected side-effect that they are operating using a different random number. Whatever the technicalities, in some circumstances it may be very difficult to establish how independent an arbitrary Score implementation is. This is somewhat of an irony - as this kind of abstraction (i.e. an arbitrary score function) is key to a functional language.

As another example, something like the patented Google PageRank [8], relies on the linkages between pages in order to score their importance. What is important here is that the score of a page is derived from the importance of the pages that link to it. In this case, you'd need to be careful about how you built your Score function (to Rank the pages) as the score of an individual page is dependent on the scores of others [9].

So - it isn't the map-reduce that is operating in parallel. It's the means of definition of the source data and the transformation that is the first enabler to parallel operation. You can't necessarily take map-reduce and apply it wholesale to do an existing complex financial risk analysis, for example. There are likely complex algorithmic dependencies that simply cannot be accommodated - certainly at least without recasting the problem or data in a manner that suits the model. A solution to problem might actually require a change in data definition, or might require a the design several map-reduce operations to operate. Or simply, the problem might have so many implicit algorithmic dependencies that map-reduce isn't appropriate at all.

Fundamentally whilst map-reduce is useful at increasing the level of parallelism in your code, it doesn't intrinsically give you parallel processing. What it does give is a common metaphor for this type of processing, enabling the development of a framework - and all the benefits a framework brings. If you need to do parallel processing, you still need to focus on the composition of the problem.

jon@jodoro.com
...


[1] Jeffrey Dean and Sanjay Ghemawat, http://labs.google.com/papers/mapreduce.html, December 2004
[2] For something like a sort it could return the same number of elements. I can't think of a valid example where a Reduce would *increase* the number of elements. So reduce is probably a pretty apt term.
[3] Google would almost certainly do this with a Web search.
[4] See http://hadoop.apache.org/
[5] Reality is that they will have a lot of redundant nodes in order to scale, but this is the basic premise.
[6] In the case of a Google Web Search, the reduce is to reduce the set of results to the top 1,000 results. You never get more than 1,000: For example, http://www.google.com/search?q=the&start=990
[7] Joel Spolsky 'Can Your Programming Language Do This?', http://www.joelonsoftware.com/items/2006/08/01.html, August 2006
[8] See http://www.google.com/corporate/tech.html
[9] Clearly Google has approaches to solve this dependency, but it is useful in illustrating the point.