Tuesday, November 25, 2008

Corporate OpenID

We'd like to use OpenID in the solution we're developing.

There are currently a large number of OpenID providers, including AOL, the BBC, Google and Yahoo! ... and a lot of individuals will already have accounts with one of these providers. However, we'd also like to offer the same experience to corporate users.

The solution to this seems pretty obvious - corporates should become OpenID providers.

It should also be possible to do this in quite a secure fashion. When an individual uses Open ID to a login site, a few key things happen:

  1. The site contacts your OpenID provider and works out (a) the login location and (b) establishes a secret key for this session.

  2. The user is then redirected to the login location to enter their username and password. This is page of the provider (e.g. GMail), not the site you're actually trying to access. As such, your password is kept between you and the provider alone.

  3. If you're successful, you're send back to the site. The key is used to verify that you've signed in with the provider.

There is clearly a bit more to it than that - you can opt in and out of various things for example - but that's the basic dialog.

There is nothing stopping corporates becoming OpenID providers for themselves. To achieve this, they would put a system in their DMZ to interact with relying parties (i.e. the sites using the OpenID).

The "sign-in" page could be established at a location internal to their network (this is not necessary and is perhaps limiting, but it would increase security). As such, when you hit an external website, you'd be redirected to an internal site to actually log in. The employee's username and password never actually leave the internal network, encrypted or otherwise. This login process would also be (more) difficult to spoof or phish, and remain quite resistant to a lot of DNS attacks.

Even better, such a solution could be single signon (SSO) - using the employee's login session at their workstation, rather than requiring them to enter their username and password again.

Sun already do something similar to all of this- they provide OpenIDs to their employees via OpenID at Work. Although from what I can tell, this is a separate identity, rather than being linked to internal corporate ones.

Microsoft is a big supporter of OpenID - For example, Windows Live will support Open ID. However, I can't find any specific literature regarding a turn-key "Turn your Active Directory into an OpenID provider". As many corporates rely on Active Directory, this kind of solution would be a rapid enabler.

If anyone knows of solutions or initiatives looking into this, drop me a line.



Friday, September 19, 2008

Burn your CPU Cycles

Doug (pictured) and I are working on a few new technical approaches at the moment, aiming to meet some stringent functional targets we've set for ourselves [1]. This has led to some curious insights, especially around how our approach is affected by some ingrained prejudices.

Whilst I'm hardly near being tagged with elder-statesmen euphemism, I must admit that I've got an antiquated aversion to certain types of waste - wasted CPU cycles and wasted memory [2]. This is perhaps a side-effect of coming into technology via embedded programming. Either way, even after years of working in and around resource-hungry Java environments, I've never shaken the feeling that it's somehow lazy or wrong.

Paul Graham makes reference to good and bad waste in his essay, The Hundred-Year Language. He argues about the beneficial tradeoff of giving up some performance for elegance. We certainly subscribe to this. However, it's sometimes hard to narrow down the good from the bad [3]. For me, a good example is batch versus online (realtime) processes. If you've ever worked in technology at a large organisation, you've probably seen your share of batch processing.

Now, there are good reasons for batch processes. However, in my experience, it's more often a form of institutional laziness. Online integration is complex, harder to test and usually much more expensive to develop (upfront cost). A simple batch system is really simple. On the other hand, batch systems can become operational nightmares (ongoing cost). Compounding this, once you get beyond a certain threshold with batch processes the complexity goes through the roof [4, 5]. You can end up processing a lot of data that hasn't changed and simply didn't need processing. Even so, organisations still plough on with batch systems.

However, there is another key to all of this that is sometimes overlooked - your batch process is probably leading to a poor end user experience.

If you've ever requested something and been told "Sorry, it won't be processed until the next business day", then there is probably a batch system somewhere to blame. At one organisation I was at, you could change you address online (great!), but it took three days for this to actually be trickle down to all the necessary downstream systems. You can probably guess that this regularly led to some unhappy customers.

When Doug and I are working on these targets, we have come up with few ideas that will be processing intensive. The instant instinct was always to think "no way, that'll use a whole lot of resources". Usually followed by "... what a waste.". However, is it really a waste? I've come to the conclusion that this is not the right way to look at the problem.

This type of mentality locks you into giving users what they have today. You're defining parameters around the user's experience that align with some (subconscious) mental model of a level of resource consumption. The fact is that the next generation of successful applications will be the ones giving users more... And to give them more, you're going to have to crank up the output.

In my view, a lot of developers are overly fixated with performance when they should have a focus on scale. Performance is important, but scale is just fundamental. If you can give your users a better experience, and your application can scale out - then it's easy to see that you're probably on a winner [6].

If you want to add that Web 2.0 tagging system to your widget, then you're going to have to user more resources. Working on a better search? Well, I think it would be a certain bet that it's going to be using a lot more resource. I recall when I first saw YouTube my immediate reaction was "they're crazy - this is going to crash and burn - they're just going to waste a whole lot of money on bandwidth". My fixation with the bandwidth consumption blinded me to the fact that it was giving people something that they really wanted [7].

So we have a new rule here - the computer works for the user. If we hit an example where we're using more resources, but improving the end user experience as a result - we take that as a sign that we're on the right track. We even push it where we can - We look for opportunities to use more resources, asking ourselves if it's improving the outcome for the user.

... Then the challenge for us is to make it scale.


[1] Partially to give ourselves a stretch target to broaden our thinking, but primarily to attain a target user experience.
[2] This reminds me of my Grandmother and her Post-WWII frugality. She couldn't tolerate waste of any kind. Much to her annoyance, I could never like Bubble and Squeak.
[3] If you've ever worked at a project-driven organisation you'll be aware of the bias to run a project cheaply, usually ignoring the higher long term cost.
[4] When I was at a large UK bank the batch used to run overnight - It was a pretty busy batch window - after everything ran overnight they only had 14 minutes to spare, so god help them if something went wrong. These same batch windows were also an impediment to doing simple things - like full 24/7 online banking, or opening branches on a Saturday.
[5] I also recall some developers who used to maintain a large batch operation. They started out using Microsoft Project to control all the windows and the dependencies (using the leveling feature). Eventually it got so complex they had to write a bespoke tool to maintain the schedules - let alone the complexity of actually monitoring and maintaining it.
[6] Naturally, there is a cost-benefit ratio coming in there somewhere. If you've ever needed to fix an application that doesn't scale, as opposed to one that is just plain resource hungry then you'll know there is a world of difference.
[7] I had a similarly incorrect first reaction to Flickr.

Thursday, September 11, 2008

Patents explained for Start Ups

A couple of weeks back Jon and I visited an Intellectual Property Lawyer to discuss the steps required to take out some patents on our technology concepts. Apart from the token IP Law education they squeezed into my degree all those many years back, I have never ventured into this side of business before. So I found this process pretty enlightening. [1]

If you are seriously considering patenting, do find a patent attorney and have a chat (yes, specifically patent attorney as opposed to intellectual property lawyer). I say this both as a disclaimer (IANAL), but much more than that, it's simply a good idea. Patent and IP Lawyers are generally experienced, clever, well-connected people and can offer good advice and referrals on a lot of topics. [2]

The first consultation is usually free (in fact let the alarm bells ring if they try to charge you). Payment generally only starts when you decide to go ahead with the process. [3]

The first thing our lawyer asked was if we were ethically opposed to patents, especially software patents. Irrespective of the answer, the advice was still to consider applying – if for no other reason than to provide protection against aggressive competitors or "patent trolls"[4]. One could, I guess, argue that this is what any lawyer is going to tell you. Even so, anyone in the industry would probably be aware that it still makes a lot of sense. He then proceeded to walk us through the different stages of registering a patent, and the different options we had – including when and roughly how much money we would need to fork over.

The basic premise of a patent is that you are disclosing your invention to the state in exchange for exclusive rights. [5] To do this you need to demonstrate that you have innovated in your chosen field beyond what would be considered a normal logical step.

Simply having your patent submission accepted doesn’t necessarily mean that you have exclusive rights to your invention. It can still be rejected by the courts. This is due to two reasons:

  1. Someone else may have lodged the same or similar invention before you. The patent approvers do not trawl through all of the existing patents to determine if you are the first, this is your responsibility – and that of your competitors. Google have recently implemented Google Patents, a searchable data store of all United States patents.

  2. Ultimately the courts may find the patent unenforceable. Patents are actually written as a series of claims. The first couple are typically very broad sweeping claims that courts are unlikely to uphold; the claims then start to narrow in on the truly innovative parts of your patent becoming more and more specific. This is done in the hope that the higher claims will be upheld, whilst still providing fallback protection if they are not.

Whilst the term ‘international patent’ occasionally seems to pop up, in reality it is not possible to establish a ‘world wide’ patent. Patents are granted on a per territory basis. Not only that, the rules as to what can and can’t be patented, how the patent itself needs to be presented, and what exclusive rights you can derive from the patent differs from territory to territory.

Much of the fundamental basis of the patent system is similar in many territories. Clearly, the broad aims of patent law in the United States and European Union are very similar – However, even when the basis of the patent law is very similar, there can often be a lot of variance in the interpretation and bias of a particular system.

In the United States, for example, patent applicants can typically secure a patent by demonstrating that their application is inventive. By contrast in the European Union, there is a stronger focus on demonstrating the patent solves a real world problem. Australia has traditionally followed the US system, but is now tending to adopt the approaches of the EU. There are other differences too. In the EU the significant date in resolving conflicts is the date of submission, while in the US it is the date of invention.

Ultimately, if you need International Patent protection, and many Internet-centric technologies do, you have no choice but to individually apply for patents in each country that you wish to secure exclusive rights to your invention. Each country has a differing cost associated to them (It varies considerably, but circa several thousand dollars US.[6] )

Much of the expense is to have a patent attorney draw up the draft patent to specifically meet that country’s requirements, and to pay for an official to prosecute the patent. However, you might incur other costs too, such as requiring official translations of your documents.

There are some options though on how you go about this, and those options have differing time periods, and differing cost structures.

You can simply apply for individual patents in the specific countries you choose. This is the cheaper option if you only wish to secure patents in a few countries. Following this path you would typically look at with a 12 month period before the patent is prosecuted and your payments are due.

There is also an option to follow a path that uses the process defined in the Patent Cooperation Treaty. [7]
This isn't an ‘international patent’, but a normalised processed for applying for patents in multiple countries. Amongst other things, it allows you to set up on priority date (date of filing) across all the participating countries. [8]

This PCT process is initially more expensive (circa US$5,000 - 10,000). However, the normalised process pushes out the need for individual country patent prosecution decisions and payments by a further 18 months.

So whilst there may be a larger up-front cost, you actually buy yourself some time. If you're a start-up, this can be invaluable. It allows you to obtain a level of IP protection across all PCT countries without going through the expensive process of prosecuting these patents. This time could be invaluable for securing Venture Capital.

However, there are also other benefits. The international patent reviewers will highlight all of the common problems patents hit for the differing countries. This means when the patents are sent to the individual countries to be prosecuted it is far less likely you will see rejections, hence potentially saving you money on resubmissions. The international patent reviewers can also advice that you are unlikely to be able to secure patents in certain territories, saving you the cost of these submissions. The other advantage is that it provides you another point in the process in which you can make modifications to your patent, which may help if the specific real world problem you are solving has slightly changed since your original submission.


[1] Jon's got a little more experience, he filed for a patent back in 2000.
[2] For example, your company formation might be pretty key – especially in an IP context.
[3] Plus especially if they find your idea promising they’ll be jumping at the chance to provide a promising new IT start up some early favours.
[4] See http://www.paulgraham.com/softwarepatents.html
[5] For more details on the definition of a patent: http://en.wikipedia.org/wiki/Patent
[6] Japan is apparently one of them most expensive states to secure a patent in primarily due to the increased costs of compulsory official translation.
[7] http://en.wikipedia.org/wiki/Patent_Cooperation_Treaty
[8] The vast majority of large economies are part of the PCT ( http://www.bpmlegal.com/pctco.html). There is a list of non-members here ( http://www.bpmlegal.com/notpct.html).

Sunday, September 7, 2008

Genealogy the Ontological Problem

Genealogy is a contrast. It is a very personal experience and endeavour. The majority of people are interested in their own family, their own ties and their own history. Ultimately, however, this history is shared. As individuals investigate further back into their ancestry, the broader the common ground becomes.

We are also interested in Genealogy - because it's an both a fancinating space, and it's one that exercises a lot of interesting problems.

There are a number of sites dedicated to Genealogy for some time, but the majority of these are forums in which people collaborate (e.g. "Anyone know of a John Browne in Manchester, England, sometime around 1855"). There are also emerging sites that let you build family trees, but these are generally private trees, or limited to collaboration in family groups. TechCrunch reported on that a "war" is developing in the genealogy space [1].

Speaking very generally, a family tree is a nodal representation of relationships between individuals. The emphasis, naturally, is on genetic ties. The key relationships are Marriage (or more correctly, parents) and parentage. These ties link the individuals in the family tree.

This is relatively simple so far. However, there are more complex relationships, even with this base simple set of relationships. For example, you can infer "great" relationships (ancestry). As you add each "great", this increases exponentially. There are sibling relationships and other more specialised scenarios - such as half-siblings, step-sibilings, twins, or adoption. In modern times you now have the possibility of same-sex marriage, surrogate pregnancy or sperm and egg donation. There are also other cases, which could sometimes be skeletons in a family tree - Multiple marriages, incest, adultery. You wouldn't need to go far back in many family histories to find someone who vanished (and perhaps had another family unknown to both sides), was disowned or simply just forgotten.

These can all be accommodated in most traditional data models. However, the real complexity is that family trees are still personal and can disagree. This may be as simple as a disagreement over some base factual information such as a name (e.g. Doug vs Douglas, or Smith vs Smithe). It is considerably more complex when there are more structural differences, such as disagreements over a parentage, or and entire lineage.

This is hard to handle using traditional data models. A lot of approaches take the tack of a "conflict resolution" mode - much like source control. However, this is inadequate. The fact is, a lot of these conflicts will never be resolved. Someone's Aunt may never agree with such-and-such's Uncle. You can simply replicate all the information in each family tree, but you're creating a lot of redundant data and (severely) limiting the utility of this information. This approach simply devalues the power of the information when people do agree.

To combine this information using a single repository requires a functional and data model that is exceptionally flexible. It's somewhat clear that it's approaching what is often called the "ontology problem" [2]. Ontologies and the Taxonomies are key to many (all) information domains, and absolutely fundamental to modern information systems.

If you are managing any kind of knowledge, getting the ontology right is pretty important. If you've ever tried to classify or put something into a hierarchy, then it's likely you've hit this complication. Object-Orientated development certainly falls into this space. For example, I have a Van here, and it's classified as Vehicle, but what happens when I have a Motorhome? Or an Amphibious Vehicle? Or an Amphibious Motorhome? If I'm working in a bookstore, do I put a book under fantasy, crime or literature? It might fit in all three.

In these cases, there is no correct answer. You end up with multiple classifications, all of which are correct. Just like genealogy it depends on the context. The problem with ontologies is that they can be extremely difficult to define, and like Family Trees, they are complex, recursive, inter-dependent beasts.

When you look at the substance of the Semantic Web, ontologies and taxonomies are absolutely key. You can't semantically link things together unless the ends agree on this key ontological information [3]. It would be impossible to search for "Motorhomes" on the Semantic Web if different sources classify this information in a completely different way. This classification might work in some contexts, but not others. You might end up with islands of information that aren't compatible and cannot be interconnected - the exact opposite of what the Semantic Web is trying to achieve.

This is why we see genealogy as a generalisable problem. Crack some problems in the genealogy space and you might be solving some fundamentals for the Semantic Web - and vice versa.

jon@jodoro.com and doug@jodoro.com

[1] See http://www.techcrunch.com/2008/09/03/genis-quest-toward-one-world-family-tree/ and http://www.techcrunch.com/2008/09/06/family-tree-wars-continue-myheritage-raises-big-round-shows-impressive-growth/.
[2] It might be worthwhile looking at the Google search for this term, http://www.google.com/search?q=ontology+problem.
[3] See http://novaspivack.typepad.com/nova_spivacks_weblog/2004/11/the_ontology_pr.html

Thursday, September 4, 2008

Working towards a Vertical Language

As part of our concept work here at Jodoro, we are working on a computing language, which we internally call Teme.

Anecdotally, there appear to be a lot of languages. The memory is probably a bit sketchy by now, but I recall from my University lecture days, there were circa 100 viable High Level Languages in the late 1970s. Today this would is in 1,000s, depending on how you'd classify a viable language. Anecdotally, in my own work I'm increasingly aware (and somewhat overwhelmed) of the number of language and technology options that are available [1].

Personally, I think the increase in computing languages in the recent era boils down to a few factors:

  1. (Ever) increasing availability of processing power makes dynamic languages more accessible, and is increasing the (broader) interest and activity in this domain.

  2. Environments such as Eclipse allow make it very easy to create basic tooling for a new language. Adequate tooling is a common impediment for a language gaining adoption. I could also point at other key examples, such as Apache Ant or the LLVM project.

  3. I'm sure some pundits will disagree, but I believe dominance of languages such as Java, C# in corporate environments has allowed niche areas to develop.

  4. The Open Source Movement has greatly enriched the amount of libraries and useful code that is available. This can give languages the "shoulder of giants" kickstart they need. There are innumerable examples, but LLVM is another great example for this.

  5. We do more computing nowadays - and the computing we do is more novel.

... I think that some observers can underestimate how significant the last point is. As the computing world evolves, languages are changing in their focus and capability. A computing language is, first and foremost, a means of getting a human to tell a computer what to do. Most significantly, what we are asking computers to do is broadening day by day.

However, all of this begs the question, what is a language exactly? [2] Any significant software solution will eventually result in what I'd describe as some form or subset of a language - A large SAP or Siebel installation may nominally be running on a Java platform, but you'll need to understand the semantics of SAP or Siebel to actually develop anything. It's possible to argue that many developers will be developing Domain Specific Languages in their solutions without really realising it [3].

Equally, you may look at Microsoft .NET as a set of languages - However, at the same time, these languages all share the same underlying class library. Arguably, the "language" of the class library is a more significant than the specifics of C#, VB, or another .NET language.

This (perceived) mayhem of different languages is compounded by a degree of flux with a number of emerging technologies. For example, a lot of institutions are investing in Business Rule Engines [4]. There are a variety of these engines available, each with a different area of specialisation and their own inherent language and tooling. There also also other technologies that are emerging rapidly in corporate environments - Business Process Execution is another classic example.

With that in mind, you could consider the Google search box a language. I often use "1 USD in AUD" in Google, or "Time in London" (there are dozens more, such as calculations or stock prices). It's a way from a formal grammar, but it's a new language construct that we're all using every day. The nice thing about Google is they obviously have mined and researched these kinds of queries and catered for these common examples. It's a language that is evolving in response to the actions of users.

So why are we developing another one? To accommodate the novel things we want our system to do (novel in our minds in the very least). As Paul Graham points out as part of this Arc Language FAQ - It would be surprising if we didn't still need to design more languages. [5], [6]. If you need to work in domain that is any way novel, there is a good chance you need to start thinking about the language constructs required.

There are a number of specific reasons why someone might start developing a language in earnest. For example, there is a lot of buzz around the Semantic Web at the moment. A lot of the effort in this area has focused on the SPARQL Query Language. The development of this language, and other standards, are absolutely fundamental. For us, there were a few drivers in taking starting to look at our own language:

  • As an intellectual exercise.

  • To explore a model of computing that is highly parallel and distributed.

  • To address a particular problem domain that is unapproachable or clumsy in other languages.

In particular, we are interested in increasing the parallelism and distribution of our systems. This is key in constructing a system that can both (1) process large data sets in "user time" as well as (2) be capable of scaling to meet any increase in user demand. My previous article on Map-Reduce discusses one language construct for parallelism - we're keen to take this further and into other processing domains.

Developing in a language is a useful exercise. Developing a language lets you put a lens your work and view it in a different way. If you take your problem and look as a language problem, you see the semantics in a new light. Even if you're working in "another language" such as Java or Ruby - it's still useful to think about the constructs your are building as a language and work towards that. This is a key philosophy to languages such as Lisp [7] , but it's a still a fundamentally useful exercise regardless of your target language.

How to go about this is the next question question - It's very easy to set up the key axioms for a new language or approach, but the difficulty arises as you inevitably encounter trade-offs along the way. In our effort we're continually walking a line between having something ultimately flexible, against having something that remains interoperable.

In later articles I'll write about how we decided to go about this (for better or worse) and some of the motivations and experiences that drove our efforts. I'll also discuss what we mean by a "Vertical Language" and how that is shaping our end goals.


[1] Eric Lebherz, HyperNews Computer Language List, http://www.hypernews.org/HyperNews/get/computing/lang-list.html, 2005
[2] Trying to define what a language is semantic is perhaps the ultimate tautology, but hopefully you get my point. Wikipedia has an article on Computer Languages that is less philosophical.
[3] This is perhaps expressed more succinctly by Greenspun's Tenth Rule
[4] I've personally come across three Business Rule Engines in the same organisation - ILOG JRules, Fair Isaac Blaze and Experian. All very different rules engines, with their own language definitions.
[5] Paul Graham, Arc Language FAQ, http://www.paulgraham.com/arcfaq.html, Year Unspecified
[6] Paul Graham also has a great essay called the The Hundred-Year Language that is well worth reading.
[7] Lisp pundits call Lisp a "Meta-Language" for this very reason.

Wednesday, September 3, 2008

Looping in distributed programming

for (init_expression; loop_condition; loop_expression)
        program statement;

while (loop_condition)
        program statement;

The simple ‘for’ and ‘while’ loops above have so permeated so many programming languages that it's hard to imagine a programmer that hasn't written them innumerable times. So much so that most programmers would not bother to question its function, it is after all intuitive and self explanatory.[1]

However, as we started to look at parallel distributed execution, we realised the dynamics of these constructs are pivotal. This essay will dare to delve deeper into the simple loop construct to question how to make it perform. In particular I will focus on loop parallelisation.[2]

Historically the vast majority of loop constructs have been written to execute in a single processor, and that is exactly how the creators of the original construct intended us to use it. However, over the last decade we have seen the emergence of some large data sets, not the least of these being the Internet. This provides many classes of problems that require us to loop over such large data sets that a single processor, no matter how fast, is simply not going to return a result within tolerable real-time constraints.
Simultaneously, processor vendors are moving to an increasingly “multicore” strategy. Intel recently told developers to prepared for “thousands of cores”.[3]

There has historically been a happy marriage between the way humans think about problems and the way computers execute them – both, even if multitasking, essentially boil down to linear execution. As such the programmer writing an algorithm typically thinks and writes one step after another. But when it comes to distributed computing, suddenly we are asking computers to work in ways many people cannot – at least not alone. Programming for distributed computing expects the single programmer to derive and (somehow) write an algorithm as if an undefined number of machines would be collaboratively executing it.

With this in mind, it becomes abundantly clear that our timeless loop construct is inadequate for distributed computing. We either need a construct that informs the computer network what can be distributed, or we need an interpreter that can determine at run time what can and can't be distributed. Of course simply because something can be distributed doesn't necessarily mean it should – for example if the data set is small the distribution could be all overhead. However, there is no doubt there are many problems for which this isn’t the case.

Let us first look at the different types of loops to help us understand which can and can't be distributed.

Independent Loops

The first I like to call the Independent Loop. Such loops will execute the same logic/function over an array of data. It might be for example 'scoring' the relevance of a set of data that has been collected. The key to this type of looping is that the only data being modified is the data being looped (or the corresponding element in and identically sized 'result' dataset). An example of this type of looping is:

int data[] = {5, 4, 1, 6, 3};
for (int i = 0; i++; i < data.length)
        data[i] *= data[i];

The advantage of Independent Looping is that it can be maximally distributed. We could comfortably split 'the data' into any number of arbitrary parts and execute the same multiplication 'for' loop across each part. Concatenating the resulting data sets would leave us with an identical result as to if we had executed the data set in the single algorithm.

This type of looping is essentially a form or superset of Map-Reduce as discussed by Jon in his article on "Map-Reduce and Parrallelism" posted in July, 2008.

Accumulative Loops

The next loop type I call the Accumulative Loop. An Accumulative Loop is similar to the Independent Loop in that the function is executed on a single dataset one element at a time. However, rather than (or as well as) modifying the dataset element, the algorithm is modifying at least one variable outside the loop. What is key in this type of loop is that the sequence of modification to the external element isn’t important. This is an important distinction as if the outcome is dependent on the order of execution, then the algorithm is not deterministic in a distributed environment.
An example of a use of this kind of loop is to sum the values of a dataset:

int data[] = {5, 4, 1, 6, 3};
long sum = 0;
for (int i = 0; i++; i < data.length)
        sum += data[i];

As with the Independent Loop this kind of loop can be distributed. Care obviously needs to be made to ensure the various versions of the externally updated variables are ultimately reunited.

Dependent Loops

The last type of loop I'd like to introduce is the Dependent Loop. This is one where the result of one iteration is dependent on the last, or the outcome of the loop is dependent on the order of execution. The dependency may be data, or control related. A data dependent example is:

int data[] = {5, 4, 1, 6, 3};
for (int i = 1; i++; i < data.length)
        data[i] += data[i-1];

One such control related dependency may even be the decision whether to execute the next iteration. This is common in while loops, for example the conventional algorithm for deriving the greatest common devisor of two numbers u and v:

while (v != 0)
        temp = u % v;
        u = v;
        v = temp;

When the result of the next iteration of the loop relies upon the former we cannot distribute the execution. Instead, if possible, the code should be rewritten to be Independent or Accumulative.
An interesting observation is that the boundary case of infinite looping is a class of Dependent Loop. This is because the Independent and Accumulative Loops iterate over a predefined dataset. The infinite loop is a particularly dangerous boundary case in distributed programming as if released undetected it could literally consume all of the resources. Most applications will have an infinite loop somewhere, generally as a message processing construct (equivalent to the while(true) { } in your Java Runnable) - but these are implictly dependent because generally these loops will be consuming some kind of resource (i.e. a message queue or similar).

Of interest to me was that all three types of algorithms can be written using 'for' loops (or even 'while' loops), and hence the construct is actually too flexible to be useful in distributed programming. If for example we have:

for (...) { blah.doSomething() }

It is very difficult to determine whether this loop is Dependent, Accumulative or Independent. This may be even more difficult in a dynamic functional language such as Ruby or Lisp as you might be using function pointers that are unknown at compile time. [4]

Instead in our development we have introduced new constructs in our distributed language that allow the programmer and the computer to cleanly separate these types. The programmer must specify their intention at design time and this significantly simplifies the compiler's task of determining which loops will be distributed and which will not.


[1] The Do loop was first introduced in FORTRAN 66 in 1966. See: http://en.wikipedia.org/wiki/For_loop#Timeline_of_for_loop_in_various_programming_languages
[2] For details on other loop optimisation techniques take a look at http://en.wikipedia.org/wiki/Loop_transformation.
[3] See http://news.cnet.com/8301-13924_3-9981760-64.html
[4] This same issue is referenced in Jon’s Map-Reduce and Parallelism article.

Sunday, July 20, 2008

Defining Cloud Computing

Generally, when a new fad [1] comes along in the software industry there is a fantastic initial rush of enthusiasm, buoyed by the initial success and enthusiasm of the innovators...

...This is generally followed by a series of attempts to define just what the fad is. We've seen this in recent history with topics such as Service Orientated Architecture (SOA) or Web 2.0 - or even things that seem very specific at the outset, like AJAX. There are countless other examples going back further.

This is now the case with the term Cloud Computing. Cloud Computing is difficult to define, but it's very generally about the commoditisation of computing resource. Instead of buying a server, with a specific processing and storage capacity - you simply buy an allocation of capacity from a vendor, a capacity that you can then increase or decrease as necessary. Instead of having a number of discrete resources (e.g. Ten disks on ten different servers), you get access to storage as one, virtual service.

It seems a simple proposition from many perspectives, but in reality the distinction is pretty subtle. After all, buying a physical server in principle achieves the same result - you pay and get more capacity. However, the real difference is in the definition of the service you get, and the mechanics for accessing that service - which is where the definition gets complicated.

Recently on Slashdot (Multiple Experts Try Defining "Cloud Computing"), there was an reference to a SYS-CON article on the definition of Cloud Computing - Twenty Experts Define Cloud Computing [2].

So Cloud Computing is not unique in that is lacks an explicit boundary. Definitions in the SYS-CON Article [2] varied quite a bit. Some of the commentary included:
  • ..the broad concept of using the internet to allow people to access technology-enabled services
  • Most computer savvy folks actually have a pretty good idea of what the term "cloud computing" means: outsourced, pay-as-you-go, on-demand, somewhere in the Internet, etc.
  • Clouds are the new Web 2.0. Nice marketing shine on top of existing technology. Remember back when every company threw some ajax on their site and said "Ta da! We’re a web 2.0 company now!"? Same story, new buzz word..

One of the definitions I particularly liked was by Yan Pritzker [3]. This definition makes three key points (refer to the article for the full explanation):
  • Clouds are vast resource pools with on-demand resource allocation.
  • Clouds are virtualized.
  • Clouds tend to be priced like utilities.

All of these definitions are generally from respected pundits and enthusiasts in the area. In the formative (storming) stage of a technology these are often very insightful and useful. However, as the audience and participants grow, this is diluted very quickly. In my experience, the definition of a fad is (eventually) corrupted by these definitions - In particular, as software and hardware vendors inevitably take control of the spotlight.

It's very common in my experience to see vendors come in with the following line - (a) here is this great fad and how it's changing the world, (b) it's obviously complex, so here is a definition of it that you can digest - and (c) here is our our products... Coincidentally, our products fit that definition exactly.

This is perhaps cynical. It's perfectly natural for a vendor to define their strategy in line with industry thinking and then produce their product suite to fit. Arguably, IBM know quite a bit about Cloud Computing. IBM has been pushing the "On Demand" brand across their various offerings since 2002 [4]. This is not to say that IBM's On Demand necessarily fits some of the definitions of Cloud Computing [5] - but it's certain that IBM will fold the buzz Cloud Computing into it's strategy around On Demand [6].

However, the cynicism around the action of vendors in this space is not unfounded. There are a lot of products that suddenly become SOA or BPM or AJAX once the term becomes a hot property in the industry. Or you simply take an existing product suite, add another product or extension that fits the mould, and rebrand the lot. This also highlights one good strategy for a startup - Find an existing, established vendor, spot some kind of technology gap or inconsistency, develop against the gap and get acquired.

There is an underlying reason that this definition exercise is possible. To an outside party it may be clear that a new approach or technology has potential. This is driven by the outstanding success of the innovators and early adopters. This success often generates a justified amount of buzz in the industry. So there is potential, and a new way of doing things - and generally organisations have problems to solve. So the question inevitably arises - "This sounds good, but what can it do for me?".

Whilst this generally results in a bewildering number of answers, usually designed to push product, the fact is that often the cornerstones underlying these fads are relatively straightforward. SOA really has it's roots in Web Services - and although the two are theoretically independent, they have walked hand in hand through the hype cycle. Every presentation you will see will belie this fact, in fact they will often state the exact opposite "SOA is not Web Services" [7]. This is somewhat true, but this often leads to a definition that is a tautology.

These snippets usually get entangled for good reason. The fact is, just adopting Web Services (or whatever equivalent) simply isn't the end of the story. In fact, it's often just the first step in thousands. You have Services now, so you need to lifecycle manage them, instrument them, align them to Business Processes to get "true business benefit and reusabily", etc, etc. Most of these aren't new problems at all, but the componentisation around the original Service approach (the Web Service) has changed the unit of work, and changed how a lot of moving parts are now expected to interconnect - software, hardware, processes and people. So something like SOA moves in to fill these gaps, which clearly can be nebulous in the extreme.

So what you have is a shift in technology that quickly becomes a bigger beast. Now that you have this technology, you need the consummate shifts in the methods, tools and outlooks that support this technology. Sometimes these shifts are on the periphery, sometimes they are orders of magnitude greater than the original technology driver. This is where the definitions become nebulous. e.g. "Now that I'm building Services, which ones should I build and how do I maintain them?"

In that sense, Cloud Computing falls into a similar trap. Accepting the fundamental technology premise around Cloud Computing opens up a lot of questions. Cloud Computing may be quite simple technically, but there are a whole raft of other implications that you need to deal with - many of which are old ground. I'm sure Mainframe pundits look at concepts such as SOA and Cloud Computing and say "but we've already solved that". In fact many cynics I speak to in the industry label these initiatives as the industry's desire to get back to the Mainframe - arguable, but you can usually see their point. CORBA pundits are probably part of a similar chorus.

Irrespective, something like Cloud Computing puts these combinations together in unique ways. So whilst you're treading old ground, some of the solutions might not necessarily be the same. For example, one of the keys in Cloud Computing could be achieving a cost basis that drives economies of scale.

So, to contradict myself and provide a definition - here are some of the technology keys that I see as driving Cloud Computing:
  1. Even thought it might be transparent to an end user, Cloud Computing is run on "Commodity Hardware". This is a loose definition in itself, but generally means off-the-shelf x86 servers, often running Linux.
  2. Further to this, instead of a handful of Big Iron servers, Cloud Computing tends to rely on a large cluster of these commodity machines.
  3. Cloud Computing relies on increases in bandwidth to allow access to processing and storage resources via the Internet.

What this generally means is that:
  • Your computing services are cheaper because they are using cheaper hardware.
  • Your computing services are cheaper because utilisation is higher - if you're not using your capacity, someone else probably is.
  • It's easier to scale, as you are tapping the bigger resource pool that a single vendor can offer [8].
  • The Quality of Service (e.g. Reliability) is tuned by the inherent capacity and redundancy of the cloud. The systems aren't specified to be ultra-reliable, the resilience of a cloud is driven by having redundant nodes to take over the in the result of failure.
  • You were reliant on external parties before (e.g. ISP, Web Host), but you're increasingly reliant on them "higher up the stack".
  • Cloud Computing relies less on references that refer to physical assets (e.g. a fileshare on a specific server), but instead virtualises these for the customer. You just need a reference to the file, and a place to ask for it - you don't care where it actually resides.

Whilst most of these technology shifts are all about "Business Benefit" of some sort, it's often useful to take a concept back to it's absolute fundamentals to put it in perspective. Much of the surrounding definition often points to a philosophy or method - underlying all of this is often a series of technical shifts that is prompting this.

Going back to the technology definition can obscure some of the more conceptual benefits, but it also gives perspective - many of the pitfalls with a new technology have already been encountered before. The SYS-CON [2] links to a Paul Wallis' definition [9] highlights this fact further.

...And perhaps most of all, the key technologies are simply the easiest part to nail down.


[1] Fad is perhaps a bit of an unfair term - A fad could actually be a very significant event.
[2] Twenty Experts Define Cloud Computing, http://cloudcomputing.sys-con.com/read/612375_p.htm
[3] Yan Pritzker, Defining Cloud Computing, http://virtualization.sys-con.com/read/595685.htm
[4] J. Hoskins, B Woolf, IBM on Demand Technology Made Simple, 3rd Ed, Clear Horizon
[5] IBM's On Demand Strategy is hard to pin down specifically, it is a horizontal branding applied across a lot of the IBM silos. A search on the ibm.com site returns over 100,000 hits alone. Also see Google Search: http://www.google.com.au/search?q=ibm+%22on+demand%22
[6] M. Darcy, IBM Opens Africa's First "Cloud Computing" Center, Second Cloud Center in China, http://www-03.ibm.com/press/us/en/pressrelease/24508.wss
[7] To an extent this is absolutely true, but it isn't really true in practice. Standards under most SOA definitions are key, and Web Services are absolutely the cornerstone of this.
[8] However, just because your infrastructure scales, this doesn't mean your specific solution on that infrastructure does.
[9] P. Wallis, "A Brief History of Cloud Computing: Is the Cloud There Yet?", http://cloudcomputing.sys-con.com/read/581838.htm.

Sunday, July 13, 2008

Map Reduce and Parallelism

Map-Reduce is a powerful mechanism for parallel execution. Originally a set of semantics used in functional languages, map-reduce is now heavily employed in processing information in a variety of clustered environments. It's used by Google a lot, and Google has their own framework for handling map-reduce [1].

As would be evident, map-reduce is made up of two key concepts:

1. Map takes a list of data and applies an operation or transformation to each element of the list to produce another list of results. The general implication here is the results list is of the same magnitude as the source list. For example if you had a list of 1,000 numbers, after the map you'd have another list of 1,000 elements (be they numbers or not).
See: http://en.wikipedia.org/wiki/Map_(higher-order_function)

2. Reduce takes the list of results and compiles them in some fashion. Unlike Map, there is some kind of expectation of "reduction" or derivation of the data - that is if you had a list of 1,000, the result might be a list of 100, or a single number [2]. So a reduce could be anything from summing all the elements, to sorting them, or cutting them back to the top 100 - or any variant therein.
See: http://en.wikipedia.org/wiki/Fold_(higher-order_function)

Google list a number of advantages to the use of their map-reduce framework [1]:
- Automatic parallelisation and distribution
- Fault-tolerance
- I/O scheduling
- Status and monitoring

Most of these are operational benefits. However, the real benefit to Google in using map-reduce is within the first statement, the automatic parallelisation and distribution of processing. This is particularly important when processing very large data sets and/or responding to a user request - a user will only wait so long for a response. A user isn't going to wait ten minutes for a web search to return. So Google run the same search spread across thousands of machines, giving you a response in seconds (in reality, sub-second).

A fairly accessible example of a map-reduce operation is this kind of search. In this case, Map would take a function, such as "Score" and apply it to a large list of data - a large list of webpages. This score is representative of how well the webpage matches some criteria, such as search terms.

Reduce takes one or more list of scores and performs a "fold" or reduce. In this example, it would do something like take the list of scores and cut it back to the top 100, sorted from highest to lowest. This example of a reduce always produces 100 or less scores - give it 50 and it will produce 50 results (sorted), give it 1,000 and it will produce a list of the top 100 results (again, sorted).

For example, if I'm searching for "Jodoro", the Map looks at the 1,000 pages available and using the scoring operation it gives each page a score for the occurrence of "Jodoro" on the page. Reduce then looks at these 1,000 scores and whittles it back to the top 100. The pseudocode for this might look like:

Define MapReduce(Pages, Score, Reduce)
Scores = Map(Pages)
Result = Reduce(Scores)
Return Result

Define Map(Pages, Score)
Create a new list Scores
For each Page in Pages
PageScore = Score(Page)
Add PageScore to Scores

This is all relatively pedestrian, but the power of the a Map-Reduce is that these can be performed in parallel. To illustrate this, here is a high level piece of pseudo-code:

Define MapReduce(Pages, Score, Reduce)
Split Pages into 4 Lists, producing FirstPages, SecondPages, ThirdPages, FourthPages

FirstScores = Map(FirstPages)
SecondScores = Map(SecondPages)
ThirdScores = Map(ThirdPages)
FourthScores = Map(FourthPages)

FinalScore = Reduce(FirstScores, SecondScores, ThirdScores, FourthScores)

Return FinalScores

This map-reduce operation splits the scoring process into four pieces that are individually mapped, then ranked using the reduce function. So if you had 400 pages, you would end up with four lists of scores each 100 long (FirstScores through FourthScores). The Reduce function takes these four 100 scores and produces a list of the top 100.

It's probably worth pointing out at this stage that this is a complete search. We've not used heuristics, we've examined every page and determined the "best" result. We could have cheated. For example if a page scored very low, we might not have bothered to include it in the scored list [3]. Whilst these heuristics are useful, they are not guaranteed. However, in specific cases such as search, it's probably appropriate to discard very low results.

In the case of this example, another variant might be:

FirstScores = Score(FirstPages)
SecondScores = Score(SecondPages)
Scores1 = Score(FirstScores, SecondScores)

ThirdScores = Score(ThirdPages)
FourthScores = Score(FourthPages)
Scores2 = Score(ThirdScores, FourthScores)

FinalScore = Reduce(Scores1, Scores2)

In this case, we Reduce three times. First we take the first two results and combine them, then the second two - and finally we take the two aggregates to produce a final result.

So what does this all mean? Well, the key is that the mapping piece can occur entirely in parallel - or to be more correct, the scoring can occur in parallel. All of the four "Score" pieces above could occur independently. So you could run the four Score pieces on four different systems in order to process them faster. Google MapReduce or the open source Hadoop framework [4] take this parallelism a step further. Instead of breaking the Pages into four (or whatever) arbitrary pieces, they create a large pool of servers - each with their own set of Pages [5]. When you invoke a map-reduce, a Master server asks all of the servers to undertake the individual map on their particular data, producing the score or otherwise derived result. These scores are then "reduced" to the actual list of results [6]. So the split in the example pseudocode is actually achieved by partitioning the data across the multiple systems.

All this seems pretty straightforward and useful. However, there is a sting in the tail. What isn't made clear by this explanation is a key underlying premise - That the individual score operations are independent. If your score operation was somehow dependent or interlocked with another score operation, then clearly they cannot operate in parallel. This almost certainly not the case when matching a set of search terms against a webpage - this is exemplified by the search example. A score can occur on a webpage on one system completely independently of the scoring that is occurring on another physically distinct system.

It's probably also worth noting that the same could be said for Reduce - in our examples we showed Reduce being used either once or three times. The presumption is that the Reduce doesn't modify the underlying scores and that there are no other side-effects. If it did, the multiple uses of Reduce may also produce an unexpected result.

Joel Spolsky validly points out that an understanding of functional languages assists with understanding map-reduce [7] as map-reduce fundamentally came from semantics in languages such as Lisp. This is largely true, but the semantics of map-reduce do not necessarily imply parallel execution. Certainly the first functional implementations of map and reduce weren't intended for parallel operation at all.

So adding the right semantics into a language to give you map-reduce doesn't necessarily make the code parallel, whilst it does open up possibilities. In our search example, the structure of the data (Pages) and the independent operation of the mapping operation (Score) is absolutely fundamental to having the map-reduce operate in parallel. The use of map-reduce primarily provides a metaphor where these work together in a logical, consistent way.

If you had a piece of Java code do your scoring, then it's totally feasible that you don't know anything about the scoring implementation (as with a functional language, that's part of the point). However, you'd need to be careful that this piece of code isn't dependent on some kind of resource - it could be manipulating private variables, writing to disk, generating random numbers, or (god forbid) relying on some kind of Singleton instance. For example, your Score may initially generate a random number that is used in all the subsequent scoring operations. If you spread this across four machines you might have the unexpected side-effect that they are operating using a different random number. Whatever the technicalities, in some circumstances it may be very difficult to establish how independent an arbitrary Score implementation is. This is somewhat of an irony - as this kind of abstraction (i.e. an arbitrary score function) is key to a functional language.

As another example, something like the patented Google PageRank [8], relies on the linkages between pages in order to score their importance. What is important here is that the score of a page is derived from the importance of the pages that link to it. In this case, you'd need to be careful about how you built your Score function (to Rank the pages) as the score of an individual page is dependent on the scores of others [9].

So - it isn't the map-reduce that is operating in parallel. It's the means of definition of the source data and the transformation that is the first enabler to parallel operation. You can't necessarily take map-reduce and apply it wholesale to do an existing complex financial risk analysis, for example. There are likely complex algorithmic dependencies that simply cannot be accommodated - certainly at least without recasting the problem or data in a manner that suits the model. A solution to problem might actually require a change in data definition, or might require a the design several map-reduce operations to operate. Or simply, the problem might have so many implicit algorithmic dependencies that map-reduce isn't appropriate at all.

Fundamentally whilst map-reduce is useful at increasing the level of parallelism in your code, it doesn't intrinsically give you parallel processing. What it does give is a common metaphor for this type of processing, enabling the development of a framework - and all the benefits a framework brings. If you need to do parallel processing, you still need to focus on the composition of the problem.


[1] Jeffrey Dean and Sanjay Ghemawat, http://labs.google.com/papers/mapreduce.html, December 2004
[2] For something like a sort it could return the same number of elements. I can't think of a valid example where a Reduce would *increase* the number of elements. So reduce is probably a pretty apt term.
[3] Google would almost certainly do this with a Web search.
[4] See http://hadoop.apache.org/
[5] Reality is that they will have a lot of redundant nodes in order to scale, but this is the basic premise.
[6] In the case of a Google Web Search, the reduce is to reduce the set of results to the top 1,000 results. You never get more than 1,000: For example, http://www.google.com/search?q=the&start=990
[7] Joel Spolsky 'Can Your Programming Language Do This?', http://www.joelonsoftware.com/items/2006/08/01.html, August 2006
[8] See http://www.google.com/corporate/tech.html
[9] Clearly Google has approaches to solve this dependency, but it is useful in illustrating the point.