jodoro

Tuesday, January 20, 2009

To Undo or not to Undo

Here at Jodoro we’ve been working on the release of a Collaborative Modeling Tool designed to run out of a browser using Adobe Flex [1]. The basic premise of the application is to allow users to develop, customize, extend, or just make use of industry standards and shared models.

It's conceivable that the data sets (the models) being accessed by the client could get quite large. So we decided on an online, delta-driven communications approach. The Flex application maintains a local cache in memory for each session, retrieving data on demand when it cannot be found in the local cache.

As we started to design the user interface and discuss the sorts of features that users would expect, we naturally raised the topic of whether the user should be able to undo their work. Even though it seems to be often overlooked, we decided that Undo/Redo was important for the overall usability.

Our first thought was that there was a natural fit between the Undo/Redo actions and the delta-based communications. As such I was expecting the Undo/Redo features would be relatively straightforward to implement. However, it didn't take long for all the corner cases to emerge, and it turned out to be trickier than I’d first expected. Now that we have a fully functioning framework, I felt it worthy a blog article.

We started by discussing what our users would want and what did and didn’t work well in other applications that we use in our own everyday lives. It's somewhat common that a user can only Undo back to their last save point, or a preset number of steps, but we decided our application should preferably support undoing all the way back to the start of their current session.
One feature not often implemented is actually forking all the different Undo/Redo pathways as the user worked. However, we decided this was much too complex for the user to track, and settled on a linear progression.

We also muted on getting rid of Save entirely – instead synchronizing on every user action, or at some logical interval. Whilst this was technically possible, we decided that keeping the existing Save metaphor would be more comfortable for users. Finally we decided a cancel function to return the user to their last Save Point would also be useful (although not strictly necessary).

Framework

With my requirements in hand I set about designing the framework. I decided to capture a linked list, the Delta Chain, of all of the user actions that change the state of my model. To simplify the implementation I initialize the Delta Chain with a StartPlaceHolder Delta Object that, as the name suggests, always remains at the front of the chain. The framework keeps a pointer, currentDelta, to the most recently executed Delta Object, initially pointing to the StartPlaceHolder Delta Object. This pointer is represented in all of the diagrams below as a red triangle.

Figure 1: Basic action manipulation

Figure 1 demonstrates the frame-by-frame changes in the Delta Object chain as various user actions and undo and redo commands are enacted by the user. New Delta Objects are always added to the current end of the chain, and the currentDelta pointer is always updated to point to the newly added delta. As the user undoes their actions the pointer moves to the left, towards the StartPlaceHolder in the Delta Chain. As they redo the actions the pointer moves to the right. If a user has undone one or more items when a new action is added, the framework drops the undone Delta Obects and attaches the new Delta Object to the right of the currentDelta pointer, then updates the currentDelta to reference the newly added Delta Object. My StartPlaceHolder token can never be undone, and hence can also never be dropped.

I created an IModelDelta interface to be realized by all of my Delta Objects, and added the following methods:

        function doAction(m:Model):void;
        function undoAction(m:Model):void;
        function redoAction(m:Model):void;

My framework executes the doAction method when a Delta Object is first added to the Delta Chain (and never again). undoAction and redoAction are called for every Undo and Redo of that action respectively.

Save

Figure 2: Saving actions

I then moved to the SaveDelta (Figure 2), which in our case was an action to send model deltas in XML format to the server, based upon changes since the last save. The actual save itself is treated as a special case and can never be undone –the data has already been persisted to the server. If a user undoes back past a Save Point, the next user enacted save will calculate the necessary deltas (reversals) in order to make the model consistent.

The SaveDelta doAction generates the combined XML by making use of another IModelDelta method:

function generateXMLDeltas(m:Model):XMLList;

The SaveDelta doAction method traverses backwards along the chain of Delta Objects and calls generateXMLDeltas for each Delta Object until it reaches the last stable save point. One of my early realizations was that the last stable save point may not necessarily be a SaveDelta. It may obviously be the StartPlaceHolder, but it could also be a CancelDelta, which I discuss later in the article.

As in Figure 1, if the user performs an Undo, and then another task, the undone Delta Object is discarded. This approach works fine for any deltas that have not yet been saved to the server, but is more difficult when the user has undone past a Save Point – these cannot simply be discarded as we need to generate reversals at the next Save. There are a number of options to handle this:

The framework could generate the reversals upon discarding the objects, and update the server immediately. This simplifies the Delta Chain, but at the expense of increased server communications and inconsistency with the Save/Cancel metaphor.

The reversal XML could be generated and stored in memory until the next SaveDelta doAction method is executed. This again simplifies the Delta Chain, but means the Undo/Redo framework needs to also understand the reversal data. The XML generation also becomes disjointed.

The saved delta objects could be stored in a temporary list until the next SaveDelta doAction is executed. This creates an overhead of moving objects around arrays in memory, and is more complicated to build, but is actually quite a clean approach. It also takes the actions out of order which is fine unless you intend to implement a cancel function.

The objects could be left in place and marked to indicate to the framework that they are to be ignored except by the next SaveDelta doAction which will generate its reversal XMLs and drop it from the chain.

Figure 3: Undoing actions already saved

My current implementation is option 4 (Figure 3). I introduced a new flag called isDead and added four new methods to our IModelDelta interface:

        function isDone():Boolean;
        function setDoneFlag(d:Boolean):void;
        function isDead():Boolean;
        function setDeadFlag(d:Boolean):void;

The framework now determines whether a Delta Object should be discarded, or just marked dead based on whether the Delta Object had already been saved to server. Such Delta Objects are easy to detect as they have been undone and lie to the left of a SaveDelta in the chain. The framework also now ignores dead objects, skipping them in all Undo and Redo actions (along with all SaveDeltas). To everything except the next executed SaveDelta doAction method a dead Delta Object is dead! Finally I revisited my generateXMLDelta methods for each Delta Object and upgraded them to generate reversal XML deltas when the object is in a dead state.

It is also worth noting that there is a difference between an item that has been undone and lies before the previous SaveDelta, and a delta that has been marked dead. The first scenario happens when users undo actions that have already been saved, but can still be redone by the user. The second scenario represents items that would have been discarded if not for the fact that they have already been saved to the server – The user can no longer redo these actions, but we still need to generate a reversal.

My SaveDelta doAction currently iterates backwards through the Delta Object chain looking for:

Actioned Delta Objects that haven’t yet been saved. When traversing backwards from the current SaveDelta, these deltas always appear before a previous CancelDelta or SaveDelta (or StartPlaceHolder).

Undone Delta Objects that have been marked dead. Again traversing backwards from the current SaveDelta, these always appear after a previous SaveDelta. Technically the implementation can take advantage of the linear nature of the Delta Chain and only keep looking for dead Delta Objects while the Delta Objects remain dead. My implementation of cancel (details below) complicates this and I’m currently traversing the entire Delta Chain. At some point this will become a performance bottleneck and I will need to revisit to determine more accurately where to stop traversing.

Cancel

The CancelDelta needs to bring the system back to the last save point. This could be a SaveDelta, or the StartPlaceHolder, or it could be another CancelDelta. This is because the previous CancelDeltas themselves were resulting in returning the system to the last save point. Unlike Save, I chose to implement CancelDelta as an action that can be undone and redone. This adds complexity, but we felt also added a justifiable amount of usability.

There are two types of actions that need to be handled for a cancel:

Any actions that have been executed since the last save need to be undone.

Any actions that have been marked dead since the last save need to be redone and marked alive. (This scenario only occurs if the user has undone items that were executed before the last save.)

Figure 4: Cancelling unsaved actions

When all of the actions in the scope of the CancelDelta have occurred since the last stable save point (Figure 4), the doAction of the CancelDelta simply needs to undo these Delta Objects. The CancelDelta undoAction and redoAction should then respectively redo and undo these very same Delta Objects.

Figure 5: Cancelling undoes of saved actions

However, if Delta Objects before the SaveDelta have been marked dead (Figure 5), the CancelDelta doAction also needs to resurrect them and call their redoAction methods. The CancelDelta undoAction also needs to call each Delta Object’s undoAction, and reinstate each as dead. The CancelDelta redoAction finally needs to reverse the steps again.

Figure 6: A complex cancel scenario

There will also be scenarios where the same CancelDelta needs to manage both the undoing of actioned but unsaved Delta Objects and the resurrecting of dead Delta Objects (Figure 6). Luckily because new Delta Objects are always added to the end of the Delta Chain, and any undone items not saved are dropped, the Undone Dead Delta Objects that need to be Redone will always appear in the Delta Chain immediately before the actioned Delta Objects that need to be undone (ignoring SaveDeltas). As such, in my implementation of the CancelDelta doAction method I keep two counters, (1) the number of unsaved deltas I reverse, and (2) the number of dead deltas I resurrect. My CancelDelta undoAction and redoAction methods use these two counters to ensure they unwind and rewind the Deltas in the correct ways and order.

Summary

With all of this in place I was finally ready to start building the real user actions! But you’ll be pleased to know that this became the easy part. I just needed to ensure that each new Delta Object’s doAction, undoAction and redoAction methods left the local cached model in the correct respective states, and that the generateXMLDeltas method implemented the correct XML deltas and reversals based on the current state of that Delta Object. The framework I’d built handled the rest for me. Before long our whole application had seamless Undo/Redo functionality and I was very pleased I’d put the effort in up front to get it right.

doug@jodoro.com

[1] We chose to build the front end in Adobe Flex primarily for its rich set of user interface features, and for the ease of distribution to our potential customer base through the Flash plug-in.

Tuesday, November 25, 2008

Corporate OpenID

We'd like to use OpenID in the solution we're developing.

There are currently a large number of OpenID providers, including AOL, the BBC, Google and Yahoo! ... and a lot of individuals will already have accounts with one of these providers. However, we'd also like to offer the same experience to corporate users.

The solution to this seems pretty obvious - corporates should become OpenID providers.

It should also be possible to do this in quite a secure fashion. When an individual uses Open ID to a login site, a few key things happen:

The site contacts your OpenID provider and works out (a) the login location and (b) establishes a secret key for this session.

The user is then redirected to the login location to enter their username and password. This is page of the provider (e.g. GMail), not the site you're actually trying to access. As such, your password is kept between you and the provider alone.

If you're successful, you're send back to the site. The key is used to verify that you've signed in with the provider.

There is clearly a bit more to it than that - you can opt in and out of various things for example - but that's the basic dialog.

There is nothing stopping corporates becoming OpenID providers for themselves. To achieve this, they would put a system in their DMZ to interact with relying parties (i.e. the sites using the OpenID).

The "sign-in" page could be established at a location internal to their network (this is not necessary and is perhaps limiting, but it would increase security). As such, when you hit an external website, you'd be redirected to an internal site to actually log in. The employee's username and password never actually leave the internal network, encrypted or otherwise. This login process would also be (more) difficult to spoof or phish, and remain quite resistant to a lot of DNS attacks.

Even better, such a solution could be single signon (SSO) - using the employee's login session at their workstation, rather than requiring them to enter their username and password again.

Sun already do something similar to all of this- they provide OpenIDs to their employees via OpenID at Work. Although from what I can tell, this is a separate identity, rather than being linked to internal corporate ones.

Microsoft is a big supporter of OpenID - For example, Windows Live will support Open ID. However, I can't find any specific literature regarding a turn-key "Turn your Active Directory into an OpenID provider". As many corporates rely on Active Directory, this kind of solution would be a rapid enabler.

If anyone knows of solutions or initiatives looking into this, drop me a line.

...

jon@jodoro.com

Friday, September 19, 2008

Burn your CPU Cycles

Doug (pictured) and I are working on a few new technical approaches at the moment, aiming to meet some stringent functional targets we've set for ourselves [1]. This has led to some curious insights, especially around how our approach is affected by some ingrained prejudices.

Whilst I'm hardly near being tagged with elder-statesmen euphemism, I must admit that I've got an antiquated aversion to certain types of waste - wasted CPU cycles and wasted memory [2]. This is perhaps a side-effect of coming into technology via embedded programming. Either way, even after years of working in and around resource-hungry Java environments, I've never shaken the feeling that it's somehow lazy or wrong.

Paul Graham makes reference to good and bad waste in his essay, The Hundred-Year Language. He argues about the beneficial tradeoff of giving up some performance for elegance. We certainly subscribe to this. However, it's sometimes hard to narrow down the good from the bad [3]. For me, a good example is batch versus online (realtime) processes. If you've ever worked in technology at a large organisation, you've probably seen your share of batch processing.

Now, there are good reasons for batch processes. However, in my experience, it's more often a form of institutional laziness. Online integration is complex, harder to test and usually much more expensive to develop (upfront cost). A simple batch system is really simple. On the other hand, batch systems can become operational nightmares (ongoing cost). Compounding this, once you get beyond a certain threshold with batch processes the complexity goes through the roof [4, 5]. You can end up processing a lot of data that hasn't changed and simply didn't need processing. Even so, organisations still plough on with batch systems.

However, there is another key to all of this that is sometimes overlooked - your batch process is probably leading to a poor end user experience.

If you've ever requested something and been told "Sorry, it won't be processed until the next business day", then there is probably a batch system somewhere to blame. At one organisation I was at, you could change you address online (great!), but it took three days for this to actually be trickle down to all the necessary downstream systems. You can probably guess that this regularly led to some unhappy customers.

When Doug and I are working on these targets, we have come up with few ideas that will be processing intensive. The instant instinct was always to think "no way, that'll use a whole lot of resources". Usually followed by "... what a waste.". However, is it really a waste? I've come to the conclusion that this is not the right way to look at the problem.

This type of mentality locks you into giving users what they have today. You're defining parameters around the user's experience that align with some (subconscious) mental model of a level of resource consumption. The fact is that the next generation of successful applications will be the ones giving users more... And to give them more, you're going to have to crank up the output.

In my view, a lot of developers are overly fixated with performance when they should have a focus on scale. Performance is important, but scale is just fundamental. If you can give your users a better experience, and your application can scale out - then it's easy to see that you're probably on a winner [6].

If you want to add that Web 2.0 tagging system to your widget, then you're going to have to user more resources. Working on a better search? Well, I think it would be a certain bet that it's going to be using a lot more resource. I recall when I first saw YouTube my immediate reaction was "they're crazy - this is going to crash and burn - they're just going to waste a whole lot of money on bandwidth". My fixation with the bandwidth consumption blinded me to the fact that it was giving people something that they really wanted [7].

So we have a new rule here - the computer works for the user. If we hit an example where we're using more resources, but improving the end user experience as a result - we take that as a sign that we're on the right track. We even push it where we can - We look for opportunities to use more resources, asking ourselves if it's improving the outcome for the user.

... Then the challenge for us is to make it scale.

jon@jodoro.com

[1] Partially to give ourselves a stretch target to broaden our thinking, but primarily to attain a target user experience.
[2] This reminds me of my Grandmother and her Post-WWII frugality. She couldn't tolerate waste of any kind. Much to her annoyance, I could never like Bubble and Squeak.
[3] If you've ever worked at a project-driven organisation you'll be aware of the bias to run a project cheaply, usually ignoring the higher long term cost.
[4] When I was at a large UK bank the batch used to run overnight - It was a pretty busy batch window - after everything ran overnight they only had 14 minutes to spare, so god help them if something went wrong. These same batch windows were also an impediment to doing simple things - like full 24/7 online banking, or opening branches on a Saturday.
[5] I also recall some developers who used to maintain a large batch operation. They started out using Microsoft Project to control all the windows and the dependencies (using the leveling feature). Eventually it got so complex they had to write a bespoke tool to maintain the schedules - let alone the complexity of actually monitoring and maintaining it.
[6] Naturally, there is a cost-benefit ratio coming in there somewhere. If you've ever needed to fix an application that doesn't scale, as opposed to one that is just plain resource hungry then you'll know there is a world of difference.
[7] I had a similarly incorrect first reaction to Flickr.

Thursday, September 11, 2008

Patents explained for Start Ups

A couple of weeks back Jon and I visited an Intellectual Property Lawyer to discuss the steps required to take out some patents on our technology concepts. Apart from the token IP Law education they squeezed into my degree all those many years back, I have never ventured into this side of business before. So I found this process pretty enlightening. [1]

If you are seriously considering patenting, do find a patent attorney and have a chat (yes, specifically patent attorney as opposed to intellectual property lawyer). I say this both as a disclaimer (IANAL), but much more than that, it's simply a good idea. Patent and IP Lawyers are generally experienced, clever, well-connected people and can offer good advice and referrals on a lot of topics. [2]

The first consultation is usually free (in fact let the alarm bells ring if they try to charge you). Payment generally only starts when you decide to go ahead with the process. [3]

The first thing our lawyer asked was if we were ethically opposed to patents, especially software patents. Irrespective of the answer, the advice was still to consider applying – if for no other reason than to provide protection against aggressive competitors or "patent trolls"[4]. One could, I guess, argue that this is what any lawyer is going to tell you. Even so, anyone in the industry would probably be aware that it still makes a lot of sense. He then proceeded to walk us through the different stages of registering a patent, and the different options we had – including when and roughly how much money we would need to fork over.

The basic premise of a patent is that you are disclosing your invention to the state in exchange for exclusive rights. [5] To do this you need to demonstrate that you have innovated in your chosen field beyond what would be considered a normal logical step.

Simply having your patent submission accepted doesn’t necessarily mean that you have exclusive rights to your invention. It can still be rejected by the courts. This is due to two reasons:

Someone else may have lodged the same or similar invention before you. The patent approvers do not trawl through all of the existing patents to determine if you are the first, this is your responsibility – and that of your competitors. Google have recently implemented Google Patents, a searchable data store of all United States patents.

Ultimately the courts may find the patent unenforceable. Patents are actually written as a series of claims. The first couple are typically very broad sweeping claims that courts are unlikely to uphold; the claims then start to narrow in on the truly innovative parts of your patent becoming more and more specific. This is done in the hope that the higher claims will be upheld, whilst still providing fallback protection if they are not.

Whilst the term ‘international patent’ occasionally seems to pop up, in reality it is not possible to establish a ‘world wide’ patent. Patents are granted on a per territory basis. Not only that, the rules as to what can and can’t be patented, how the patent itself needs to be presented, and what exclusive rights you can derive from the patent differs from territory to territory.

Much of the fundamental basis of the patent system is similar in many territories. Clearly, the broad aims of patent law in the United States and European Union are very similar – However, even when the basis of the patent law is very similar, there can often be a lot of variance in the interpretation and bias of a particular system.

In the United States, for example, patent applicants can typically secure a patent by demonstrating that their application is inventive. By contrast in the European Union, there is a stronger focus on demonstrating the patent solves a real world problem. Australia has traditionally followed the US system, but is now tending to adopt the approaches of the EU. There are other differences too. In the EU the significant date in resolving conflicts is the date of submission, while in the US it is the date of invention.

Ultimately, if you need International Patent protection, and many Internet-centric technologies do, you have no choice but to individually apply for patents in each country that you wish to secure exclusive rights to your invention. Each country has a differing cost associated to them (It varies considerably, but circa several thousand dollars US.[6] )

Much of the expense is to have a patent attorney draw up the draft patent to specifically meet that country’s requirements, and to pay for an official to prosecute the patent. However, you might incur other costs too, such as requiring official translations of your documents.

There are some options though on how you go about this, and those options have differing time periods, and differing cost structures.

You can simply apply for individual patents in the specific countries you choose. This is the cheaper option if you only wish to secure patents in a few countries. Following this path you would typically look at with a 12 month period before the patent is prosecuted and your payments are due.

There is also an option to follow a path that uses the process defined in the Patent Cooperation Treaty. [7]
This isn't an ‘international patent’, but a normalised processed for applying for patents in multiple countries. Amongst other things, it allows you to set up on priority date (date of filing) across all the participating countries. [8]

This PCT process is initially more expensive (circa US$5,000 - 10,000). However, the normalised process pushes out the need for individual country patent prosecution decisions and payments by a further 18 months.

So whilst there may be a larger up-front cost, you actually buy yourself some time. If you're a start-up, this can be invaluable. It allows you to obtain a level of IP protection across all PCT countries without going through the expensive process of prosecuting these patents. This time could be invaluable for securing Venture Capital.

However, there are also other benefits. The international patent reviewers will highlight all of the common problems patents hit for the differing countries. This means when the patents are sent to the individual countries to be prosecuted it is far less likely you will see rejections, hence potentially saving you money on resubmissions. The international patent reviewers can also advice that you are unlikely to be able to secure patents in certain territories, saving you the cost of these submissions. The other advantage is that it provides you another point in the process in which you can make modifications to your patent, which may help if the specific real world problem you are solving has slightly changed since your original submission.

doug@jodoro.com

[1] Jon's got a little more experience, he filed for a patent back in 2000.
[2] For example, your company formation might be pretty key – especially in an IP context.
[3] Plus especially if they find your idea promising they’ll be jumping at the chance to provide a promising new IT start up some early favours.
[4] See http://www.paulgraham.com/softwarepatents.html
[5] For more details on the definition of a patent: http://en.wikipedia.org/wiki/Patent
[6] Japan is apparently one of them most expensive states to secure a patent in primarily due to the increased costs of compulsory official translation.
[7] http://en.wikipedia.org/wiki/Patent_Cooperation_Treaty
[8] The vast majority of large economies are part of the PCT ( http://www.bpmlegal.com/pctco.html). There is a list of non-members here ( http://www.bpmlegal.com/notpct.html).

Sunday, September 7, 2008

Genealogy the Ontological Problem

Genealogy is a contrast. It is a very personal experience and endeavour. The majority of people are interested in their own family, their own ties and their own history. Ultimately, however, this history is shared. As individuals investigate further back into their ancestry, the broader the common ground becomes.

We are also interested in Genealogy - because it's an both a fancinating space, and it's one that exercises a lot of interesting problems.

There are a number of sites dedicated to Genealogy for some time, but the majority of these are forums in which people collaborate (e.g. "Anyone know of a John Browne in Manchester, England, sometime around 1855"). There are also emerging sites that let you build family trees, but these are generally private trees, or limited to collaboration in family groups. TechCrunch reported on that a "war" is developing in the genealogy space [1].

Speaking very generally, a family tree is a nodal representation of relationships between individuals. The emphasis, naturally, is on genetic ties. The key relationships are Marriage (or more correctly, parents) and parentage. These ties link the individuals in the family tree.

This is relatively simple so far. However, there are more complex relationships, even with this base simple set of relationships. For example, you can infer "great" relationships (ancestry). As you add each "great", this increases exponentially. There are sibling relationships and other more specialised scenarios - such as half-siblings, step-sibilings, twins, or adoption. In modern times you now have the possibility of same-sex marriage, surrogate pregnancy or sperm and egg donation. There are also other cases, which could sometimes be skeletons in a family tree - Multiple marriages, incest, adultery. You wouldn't need to go far back in many family histories to find someone who vanished (and perhaps had another family unknown to both sides), was disowned or simply just forgotten.

These can all be accommodated in most traditional data models. However, the real complexity is that family trees are still personal and can disagree. This may be as simple as a disagreement over some base factual information such as a name (e.g. Doug vs Douglas, or Smith vs Smithe). It is considerably more complex when there are more structural differences, such as disagreements over a parentage, or and entire lineage.

This is hard to handle using traditional data models. A lot of approaches take the tack of a "conflict resolution" mode - much like source control. However, this is inadequate. The fact is, a lot of these conflicts will never be resolved. Someone's Aunt may never agree with such-and-such's Uncle. You can simply replicate all the information in each family tree, but you're creating a lot of redundant data and (severely) limiting the utility of this information. This approach simply devalues the power of the information when people do agree.

To combine this information using a single repository requires a functional and data model that is exceptionally flexible. It's somewhat clear that it's approaching what is often called the "ontology problem" [2]. Ontologies and the Taxonomies are key to many (all) information domains, and absolutely fundamental to modern information systems.

If you are managing any kind of knowledge, getting the ontology right is pretty important. If you've ever tried to classify or put something into a hierarchy, then it's likely you've hit this complication. Object-Orientated development certainly falls into this space. For example, I have a Van here, and it's classified as Vehicle, but what happens when I have a Motorhome? Or an Amphibious Vehicle? Or an Amphibious Motorhome? If I'm working in a bookstore, do I put a book under fantasy, crime or literature? It might fit in all three.

In these cases, there is no correct answer. You end up with multiple classifications, all of which are correct. Just like genealogy it depends on the context. The problem with ontologies is that they can be extremely difficult to define, and like Family Trees, they are complex, recursive, inter-dependent beasts.

When you look at the substance of the Semantic Web, ontologies and taxonomies are absolutely key. You can't semantically link things together unless the ends agree on this key ontological information [3]. It would be impossible to search for "Motorhomes" on the Semantic Web if different sources classify this information in a completely different way. This classification might work in some contexts, but not others. You might end up with islands of information that aren't compatible and cannot be interconnected - the exact opposite of what the Semantic Web is trying to achieve.

This is why we see genealogy as a generalisable problem. Crack some problems in the genealogy space and you might be solving some fundamentals for the Semantic Web - and vice versa.

jon@jodoro.com and doug@jodoro.com

[1] See http://www.techcrunch.com/2008/09/03/genis-quest-toward-one-world-family-tree/ and http://www.techcrunch.com/2008/09/06/family-tree-wars-continue-myheritage-raises-big-round-shows-impressive-growth/.
[2] It might be worthwhile looking at the Google search for this term, http://www.google.com/search?q=ontology+problem.
[3] See http://novaspivack.typepad.com/nova_spivacks_weblog/2004/11/the_ontology_pr.html

Thursday, September 4, 2008

Working towards a Vertical Language

As part of our concept work here at Jodoro, we are working on a computing language, which we internally call Teme.

Anecdotally, there appear to be a lot of languages. The memory is probably a bit sketchy by now, but I recall from my University lecture days, there were circa 100 viable High Level Languages in the late 1970s. Today this would is in 1,000s, depending on how you'd classify a viable language. Anecdotally, in my own work I'm increasingly aware (and somewhat overwhelmed) of the number of language and technology options that are available [1].

Personally, I think the increase in computing languages in the recent era boils down to a few factors:

(Ever) increasing availability of processing power makes dynamic languages more accessible, and is increasing the (broader) interest and activity in this domain.

Environments such as Eclipse allow make it very easy to create basic tooling for a new language. Adequate tooling is a common impediment for a language gaining adoption. I could also point at other key examples, such as Apache Ant or the LLVM project.

I'm sure some pundits will disagree, but I believe dominance of languages such as Java, C# in corporate environments has allowed niche areas to develop.

The Open Source Movement has greatly enriched the amount of libraries and useful code that is available. This can give languages the "shoulder of giants" kickstart they need. There are innumerable examples, but LLVM is another great example for this.

We do more computing nowadays - and the computing we do is more novel.

... I think that some observers can underestimate how significant the last point is. As the computing world evolves, languages are changing in their focus and capability. A computing language is, first and foremost, a means of getting a human to tell a computer what to do. Most significantly, what we are asking computers to do is broadening day by day.

However, all of this begs the question, what is a language exactly? [2] Any significant software solution will eventually result in what I'd describe as some form or subset of a language - A large SAP or Siebel installation may nominally be running on a Java platform, but you'll need to understand the semantics of SAP or Siebel to actually develop anything. It's possible to argue that many developers will be developing Domain Specific Languages in their solutions without really realising it [3].

Equally, you may look at Microsoft .NET as a set of languages - However, at the same time, these languages all share the same underlying class library. Arguably, the "language" of the class library is a more significant than the specifics of C#, VB, or another .NET language.

This (perceived) mayhem of different languages is compounded by a degree of flux with a number of emerging technologies. For example, a lot of institutions are investing in Business Rule Engines [4]. There are a variety of these engines available, each with a different area of specialisation and their own inherent language and tooling. There also also other technologies that are emerging rapidly in corporate environments - Business Process Execution is another classic example.

With that in mind, you could consider the Google search box a language. I often use "1 USD in AUD" in Google, or "Time in London" (there are dozens more, such as calculations or stock prices). It's a way from a formal grammar, but it's a new language construct that we're all using every day. The nice thing about Google is they obviously have mined and researched these kinds of queries and catered for these common examples. It's a language that is evolving in response to the actions of users.

So why are we developing another one? To accommodate the novel things we want our system to do (novel in our minds in the very least). As Paul Graham points out as part of this Arc Language FAQ - It would be surprising if we didn't still need to design more languages. [5], [6]. If you need to work in domain that is any way novel, there is a good chance you need to start thinking about the language constructs required.

There are a number of specific reasons why someone might start developing a language in earnest. For example, there is a lot of buzz around the Semantic Web at the moment. A lot of the effort in this area has focused on the SPARQL Query Language. The development of this language, and other standards, are absolutely fundamental. For us, there were a few drivers in taking starting to look at our own language:

As an intellectual exercise.

To explore a model of computing that is highly parallel and distributed.

To address a particular problem domain that is unapproachable or clumsy in other languages.

In particular, we are interested in increasing the parallelism and distribution of our systems. This is key in constructing a system that can both (1) process large data sets in "user time" as well as (2) be capable of scaling to meet any increase in user demand. My previous article on Map-Reduce discusses one language construct for parallelism - we're keen to take this further and into other processing domains.

Developing in a language is a useful exercise. Developing a language lets you put a lens your work and view it in a different way. If you take your problem and look as a language problem, you see the semantics in a new light. Even if you're working in "another language" such as Java or Ruby - it's still useful to think about the constructs your are building as a language and work towards that. This is a key philosophy to languages such as Lisp [7] , but it's a still a fundamentally useful exercise regardless of your target language.

How to go about this is the next question question - It's very easy to set up the key axioms for a new language or approach, but the difficulty arises as you inevitably encounter trade-offs along the way. In our effort we're continually walking a line between having something ultimately flexible, against having something that remains interoperable.

In later articles I'll write about how we decided to go about this (for better or worse) and some of the motivations and experiences that drove our efforts. I'll also discuss what we mean by a "Vertical Language" and how that is shaping our end goals.

jon@jodoro.com

[1] Eric Lebherz, HyperNews Computer Language List, http://www.hypernews.org/HyperNews/get/computing/lang-list.html, 2005
[2] Trying to define what a language is semantic is perhaps the ultimate tautology, but hopefully you get my point. Wikipedia has an article on Computer Languages that is less philosophical.
[3] This is perhaps expressed more succinctly by Greenspun's Tenth Rule
[4] I've personally come across three Business Rule Engines in the same organisation - ILOG JRules, Fair Isaac Blaze and Experian. All very different rules engines, with their own language definitions.
[5] Paul Graham, Arc Language FAQ, http://www.paulgraham.com/arcfaq.html, Year Unspecified
[6] Paul Graham also has a great essay called the The Hundred-Year Language that is well worth reading.
[7] Lisp pundits call Lisp a "Meta-Language" for this very reason.

Wednesday, September 3, 2008

Looping in distributed programming

for (init_expression; loop_condition; loop_expression)
{
        program statement;
}

while (loop_condition)
{
        program statement;
        loop_expression;
}

The simple ‘for’ and ‘while’ loops above have so permeated so many programming languages that it's hard to imagine a programmer that hasn't written them innumerable times. So much so that most programmers would not bother to question its function, it is after all intuitive and self explanatory.[1]

However, as we started to look at parallel distributed execution, we realised the dynamics of these constructs are pivotal. This essay will dare to delve deeper into the simple loop construct to question how to make it perform. In particular I will focus on loop parallelisation.[2]

Historically the vast majority of loop constructs have been written to execute in a single processor, and that is exactly how the creators of the original construct intended us to use it. However, over the last decade we have seen the emergence of some large data sets, not the least of these being the Internet. This provides many classes of problems that require us to loop over such large data sets that a single processor, no matter how fast, is simply not going to return a result within tolerable real-time constraints.
Simultaneously, processor vendors are moving to an increasingly “multicore” strategy. Intel recently told developers to prepared for “thousands of cores”.[3]

There has historically been a happy marriage between the way humans think about problems and the way computers execute them – both, even if multitasking, essentially boil down to linear execution. As such the programmer writing an algorithm typically thinks and writes one step after another. But when it comes to distributed computing, suddenly we are asking computers to work in ways many people cannot – at least not alone. Programming for distributed computing expects the single programmer to derive and (somehow) write an algorithm as if an undefined number of machines would be collaboratively executing it.

With this in mind, it becomes abundantly clear that our timeless loop construct is inadequate for distributed computing. We either need a construct that informs the computer network what can be distributed, or we need an interpreter that can determine at run time what can and can't be distributed. Of course simply because something can be distributed doesn't necessarily mean it should – for example if the data set is small the distribution could be all overhead. However, there is no doubt there are many problems for which this isn’t the case.

Let us first look at the different types of loops to help us understand which can and can't be distributed.

Independent Loops

The first I like to call the Independent Loop. Such loops will execute the same logic/function over an array of data. It might be for example 'scoring' the relevance of a set of data that has been collected. The key to this type of looping is that the only data being modified is the data being looped (or the corresponding element in and identically sized 'result' dataset). An example of this type of looping is:

int data[] = {5, 4, 1, 6, 3};
for (int i = 0; i++; i < data.length)
{
data[i] *= data[i];
}

The advantage of Independent Looping is that it can be maximally distributed. We could comfortably split 'the data' into any number of arbitrary parts and execute the same multiplication 'for' loop across each part. Concatenating the resulting data sets would leave us with an identical result as to if we had executed the data set in the single algorithm.

This type of looping is essentially a form or superset of Map-Reduce as discussed by Jon in his article on "Map-Reduce and Parrallelism" posted in July, 2008.

Accumulative Loops

The next loop type I call the Accumulative Loop. An Accumulative Loop is similar to the Independent Loop in that the function is executed on a single dataset one element at a time. However, rather than (or as well as) modifying the dataset element, the algorithm is modifying at least one variable outside the loop. What is key in this type of loop is that the sequence of modification to the external element isn’t important. This is an important distinction as if the outcome is dependent on the order of execution, then the algorithm is not deterministic in a distributed environment.
An example of a use of this kind of loop is to sum the values of a dataset:

int data[] = {5, 4, 1, 6, 3};
long sum = 0;
for (int i = 0; i++; i < data.length)
{
sum += data[i];
}

As with the Independent Loop this kind of loop can be distributed. Care obviously needs to be made to ensure the various versions of the externally updated variables are ultimately reunited.

Dependent Loops

The last type of loop I'd like to introduce is the Dependent Loop. This is one where the result of one iteration is dependent on the last, or the outcome of the loop is dependent on the order of execution. The dependency may be data, or control related. A data dependent example is:

int data[] = {5, 4, 1, 6, 3};
for (int i = 1; i++; i < data.length)
{
        data[i] += data[i-1];
}

One such control related dependency may even be the decision whether to execute the next iteration. This is common in while loops, for example the conventional algorithm for deriving the greatest common devisor of two numbers u and v:

while (v != 0)
{
        temp = u % v;
        u = v;
        v = temp;
}

When the result of the next iteration of the loop relies upon the former we cannot distribute the execution. Instead, if possible, the code should be rewritten to be Independent or Accumulative.
An interesting observation is that the boundary case of infinite looping is a class of Dependent Loop. This is because the Independent and Accumulative Loops iterate over a predefined dataset. The infinite loop is a particularly dangerous boundary case in distributed programming as if released undetected it could literally consume all of the resources. Most applications will have an infinite loop somewhere, generally as a message processing construct (equivalent to the while(true) { } in your Java Runnable) - but these are implictly dependent because generally these loops will be consuming some kind of resource (i.e. a message queue or similar).

Of interest to me was that all three types of algorithms can be written using 'for' loops (or even 'while' loops), and hence the construct is actually too flexible to be useful in distributed programming. If for example we have:

for (...) { blah.doSomething() }

It is very difficult to determine whether this loop is Dependent, Accumulative or Independent. This may be even more difficult in a dynamic functional language such as Ruby or Lisp as you might be using function pointers that are unknown at compile time. [4]

Instead in our development we have introduced new constructs in our distributed language that allow the programmer and the computer to cleanly separate these types. The programmer must specify their intention at design time and this significantly simplifies the compiler's task of determining which loops will be distributed and which will not.

doug@jodoro.com
...

[1] The Do loop was first introduced in FORTRAN 66 in 1966. See: http://en.wikipedia.org/wiki/For_loop#Timeline_of_for_loop_in_various_programming_languages
[2] For details on other loop optimisation techniques take a look at http://en.wikipedia.org/wiki/Loop_transformation.
[3] See http://news.cnet.com/8301-13924_3-9981760-64.html
[4] This same issue is referenced in Jon’s Map-Reduce and Parallelism article.