Friday, September 19, 2008

Burn your CPU Cycles

Doug (pictured) and I are working on a few new technical approaches at the moment, aiming to meet some stringent functional targets we've set for ourselves [1]. This has led to some curious insights, especially around how our approach is affected by some ingrained prejudices.

Whilst I'm hardly near being tagged with elder-statesmen euphemism, I must admit that I've got an antiquated aversion to certain types of waste - wasted CPU cycles and wasted memory [2]. This is perhaps a side-effect of coming into technology via embedded programming. Either way, even after years of working in and around resource-hungry Java environments, I've never shaken the feeling that it's somehow lazy or wrong.

Paul Graham makes reference to good and bad waste in his essay, The Hundred-Year Language. He argues about the beneficial tradeoff of giving up some performance for elegance. We certainly subscribe to this. However, it's sometimes hard to narrow down the good from the bad [3]. For me, a good example is batch versus online (realtime) processes. If you've ever worked in technology at a large organisation, you've probably seen your share of batch processing.

Now, there are good reasons for batch processes. However, in my experience, it's more often a form of institutional laziness. Online integration is complex, harder to test and usually much more expensive to develop (upfront cost). A simple batch system is really simple. On the other hand, batch systems can become operational nightmares (ongoing cost). Compounding this, once you get beyond a certain threshold with batch processes the complexity goes through the roof [4, 5]. You can end up processing a lot of data that hasn't changed and simply didn't need processing. Even so, organisations still plough on with batch systems.

However, there is another key to all of this that is sometimes overlooked - your batch process is probably leading to a poor end user experience.

If you've ever requested something and been told "Sorry, it won't be processed until the next business day", then there is probably a batch system somewhere to blame. At one organisation I was at, you could change you address online (great!), but it took three days for this to actually be trickle down to all the necessary downstream systems. You can probably guess that this regularly led to some unhappy customers.

When Doug and I are working on these targets, we have come up with few ideas that will be processing intensive. The instant instinct was always to think "no way, that'll use a whole lot of resources". Usually followed by "... what a waste.". However, is it really a waste? I've come to the conclusion that this is not the right way to look at the problem.

This type of mentality locks you into giving users what they have today. You're defining parameters around the user's experience that align with some (subconscious) mental model of a level of resource consumption. The fact is that the next generation of successful applications will be the ones giving users more... And to give them more, you're going to have to crank up the output.

In my view, a lot of developers are overly fixated with performance when they should have a focus on scale. Performance is important, but scale is just fundamental. If you can give your users a better experience, and your application can scale out - then it's easy to see that you're probably on a winner [6].

If you want to add that Web 2.0 tagging system to your widget, then you're going to have to user more resources. Working on a better search? Well, I think it would be a certain bet that it's going to be using a lot more resource. I recall when I first saw YouTube my immediate reaction was "they're crazy - this is going to crash and burn - they're just going to waste a whole lot of money on bandwidth". My fixation with the bandwidth consumption blinded me to the fact that it was giving people something that they really wanted [7].

So we have a new rule here - the computer works for the user. If we hit an example where we're using more resources, but improving the end user experience as a result - we take that as a sign that we're on the right track. We even push it where we can - We look for opportunities to use more resources, asking ourselves if it's improving the outcome for the user.

... Then the challenge for us is to make it scale.

[1] Partially to give ourselves a stretch target to broaden our thinking, but primarily to attain a target user experience.
[2] This reminds me of my Grandmother and her Post-WWII frugality. She couldn't tolerate waste of any kind. Much to her annoyance, I could never like Bubble and Squeak.
[3] If you've ever worked at a project-driven organisation you'll be aware of the bias to run a project cheaply, usually ignoring the higher long term cost.
[4] When I was at a large UK bank the batch used to run overnight - It was a pretty busy batch window - after everything ran overnight they only had 14 minutes to spare, so god help them if something went wrong. These same batch windows were also an impediment to doing simple things - like full 24/7 online banking, or opening branches on a Saturday.
[5] I also recall some developers who used to maintain a large batch operation. They started out using Microsoft Project to control all the windows and the dependencies (using the leveling feature). Eventually it got so complex they had to write a bespoke tool to maintain the schedules - let alone the complexity of actually monitoring and maintaining it.
[6] Naturally, there is a cost-benefit ratio coming in there somewhere. If you've ever needed to fix an application that doesn't scale, as opposed to one that is just plain resource hungry then you'll know there is a world of difference.
[7] I had a similarly incorrect first reaction to Flickr.

Thursday, September 11, 2008

Patents explained for Start Ups

A couple of weeks back Jon and I visited an Intellectual Property Lawyer to discuss the steps required to take out some patents on our technology concepts. Apart from the token IP Law education they squeezed into my degree all those many years back, I have never ventured into this side of business before. So I found this process pretty enlightening. [1]

If you are seriously considering patenting, do find a patent attorney and have a chat (yes, specifically patent attorney as opposed to intellectual property lawyer). I say this both as a disclaimer (IANAL), but much more than that, it's simply a good idea. Patent and IP Lawyers are generally experienced, clever, well-connected people and can offer good advice and referrals on a lot of topics. [2]

The first consultation is usually free (in fact let the alarm bells ring if they try to charge you). Payment generally only starts when you decide to go ahead with the process. [3]

The first thing our lawyer asked was if we were ethically opposed to patents, especially software patents. Irrespective of the answer, the advice was still to consider applying – if for no other reason than to provide protection against aggressive competitors or "patent trolls"[4]. One could, I guess, argue that this is what any lawyer is going to tell you. Even so, anyone in the industry would probably be aware that it still makes a lot of sense. He then proceeded to walk us through the different stages of registering a patent, and the different options we had – including when and roughly how much money we would need to fork over.

The basic premise of a patent is that you are disclosing your invention to the state in exchange for exclusive rights. [5] To do this you need to demonstrate that you have innovated in your chosen field beyond what would be considered a normal logical step.

Simply having your patent submission accepted doesn’t necessarily mean that you have exclusive rights to your invention. It can still be rejected by the courts. This is due to two reasons:

  1. Someone else may have lodged the same or similar invention before you. The patent approvers do not trawl through all of the existing patents to determine if you are the first, this is your responsibility – and that of your competitors. Google have recently implemented Google Patents, a searchable data store of all United States patents.

  2. Ultimately the courts may find the patent unenforceable. Patents are actually written as a series of claims. The first couple are typically very broad sweeping claims that courts are unlikely to uphold; the claims then start to narrow in on the truly innovative parts of your patent becoming more and more specific. This is done in the hope that the higher claims will be upheld, whilst still providing fallback protection if they are not.

Whilst the term ‘international patent’ occasionally seems to pop up, in reality it is not possible to establish a ‘world wide’ patent. Patents are granted on a per territory basis. Not only that, the rules as to what can and can’t be patented, how the patent itself needs to be presented, and what exclusive rights you can derive from the patent differs from territory to territory.

Much of the fundamental basis of the patent system is similar in many territories. Clearly, the broad aims of patent law in the United States and European Union are very similar – However, even when the basis of the patent law is very similar, there can often be a lot of variance in the interpretation and bias of a particular system.

In the United States, for example, patent applicants can typically secure a patent by demonstrating that their application is inventive. By contrast in the European Union, there is a stronger focus on demonstrating the patent solves a real world problem. Australia has traditionally followed the US system, but is now tending to adopt the approaches of the EU. There are other differences too. In the EU the significant date in resolving conflicts is the date of submission, while in the US it is the date of invention.

Ultimately, if you need International Patent protection, and many Internet-centric technologies do, you have no choice but to individually apply for patents in each country that you wish to secure exclusive rights to your invention. Each country has a differing cost associated to them (It varies considerably, but circa several thousand dollars US.[6] )

Much of the expense is to have a patent attorney draw up the draft patent to specifically meet that country’s requirements, and to pay for an official to prosecute the patent. However, you might incur other costs too, such as requiring official translations of your documents.

There are some options though on how you go about this, and those options have differing time periods, and differing cost structures.

You can simply apply for individual patents in the specific countries you choose. This is the cheaper option if you only wish to secure patents in a few countries. Following this path you would typically look at with a 12 month period before the patent is prosecuted and your payments are due.

There is also an option to follow a path that uses the process defined in the Patent Cooperation Treaty. [7]
This isn't an ‘international patent’, but a normalised processed for applying for patents in multiple countries. Amongst other things, it allows you to set up on priority date (date of filing) across all the participating countries. [8]

This PCT process is initially more expensive (circa US$5,000 - 10,000). However, the normalised process pushes out the need for individual country patent prosecution decisions and payments by a further 18 months.

So whilst there may be a larger up-front cost, you actually buy yourself some time. If you're a start-up, this can be invaluable. It allows you to obtain a level of IP protection across all PCT countries without going through the expensive process of prosecuting these patents. This time could be invaluable for securing Venture Capital.

However, there are also other benefits. The international patent reviewers will highlight all of the common problems patents hit for the differing countries. This means when the patents are sent to the individual countries to be prosecuted it is far less likely you will see rejections, hence potentially saving you money on resubmissions. The international patent reviewers can also advice that you are unlikely to be able to secure patents in certain territories, saving you the cost of these submissions. The other advantage is that it provides you another point in the process in which you can make modifications to your patent, which may help if the specific real world problem you are solving has slightly changed since your original submission.

[1] Jon's got a little more experience, he filed for a patent back in 2000.
[2] For example, your company formation might be pretty key – especially in an IP context.
[3] Plus especially if they find your idea promising they’ll be jumping at the chance to provide a promising new IT start up some early favours.
[4] See
[5] For more details on the definition of a patent:
[6] Japan is apparently one of them most expensive states to secure a patent in primarily due to the increased costs of compulsory official translation.
[8] The vast majority of large economies are part of the PCT ( There is a list of non-members here (

Sunday, September 7, 2008

Genealogy the Ontological Problem

Genealogy is a contrast. It is a very personal experience and endeavour. The majority of people are interested in their own family, their own ties and their own history. Ultimately, however, this history is shared. As individuals investigate further back into their ancestry, the broader the common ground becomes.

We are also interested in Genealogy - because it's an both a fancinating space, and it's one that exercises a lot of interesting problems.

There are a number of sites dedicated to Genealogy for some time, but the majority of these are forums in which people collaborate (e.g. "Anyone know of a John Browne in Manchester, England, sometime around 1855"). There are also emerging sites that let you build family trees, but these are generally private trees, or limited to collaboration in family groups. TechCrunch reported on that a "war" is developing in the genealogy space [1].

Speaking very generally, a family tree is a nodal representation of relationships between individuals. The emphasis, naturally, is on genetic ties. The key relationships are Marriage (or more correctly, parents) and parentage. These ties link the individuals in the family tree.

This is relatively simple so far. However, there are more complex relationships, even with this base simple set of relationships. For example, you can infer "great" relationships (ancestry). As you add each "great", this increases exponentially. There are sibling relationships and other more specialised scenarios - such as half-siblings, step-sibilings, twins, or adoption. In modern times you now have the possibility of same-sex marriage, surrogate pregnancy or sperm and egg donation. There are also other cases, which could sometimes be skeletons in a family tree - Multiple marriages, incest, adultery. You wouldn't need to go far back in many family histories to find someone who vanished (and perhaps had another family unknown to both sides), was disowned or simply just forgotten.

These can all be accommodated in most traditional data models. However, the real complexity is that family trees are still personal and can disagree. This may be as simple as a disagreement over some base factual information such as a name (e.g. Doug vs Douglas, or Smith vs Smithe). It is considerably more complex when there are more structural differences, such as disagreements over a parentage, or and entire lineage.

This is hard to handle using traditional data models. A lot of approaches take the tack of a "conflict resolution" mode - much like source control. However, this is inadequate. The fact is, a lot of these conflicts will never be resolved. Someone's Aunt may never agree with such-and-such's Uncle. You can simply replicate all the information in each family tree, but you're creating a lot of redundant data and (severely) limiting the utility of this information. This approach simply devalues the power of the information when people do agree.

To combine this information using a single repository requires a functional and data model that is exceptionally flexible. It's somewhat clear that it's approaching what is often called the "ontology problem" [2]. Ontologies and the Taxonomies are key to many (all) information domains, and absolutely fundamental to modern information systems.

If you are managing any kind of knowledge, getting the ontology right is pretty important. If you've ever tried to classify or put something into a hierarchy, then it's likely you've hit this complication. Object-Orientated development certainly falls into this space. For example, I have a Van here, and it's classified as Vehicle, but what happens when I have a Motorhome? Or an Amphibious Vehicle? Or an Amphibious Motorhome? If I'm working in a bookstore, do I put a book under fantasy, crime or literature? It might fit in all three.

In these cases, there is no correct answer. You end up with multiple classifications, all of which are correct. Just like genealogy it depends on the context. The problem with ontologies is that they can be extremely difficult to define, and like Family Trees, they are complex, recursive, inter-dependent beasts.

When you look at the substance of the Semantic Web, ontologies and taxonomies are absolutely key. You can't semantically link things together unless the ends agree on this key ontological information [3]. It would be impossible to search for "Motorhomes" on the Semantic Web if different sources classify this information in a completely different way. This classification might work in some contexts, but not others. You might end up with islands of information that aren't compatible and cannot be interconnected - the exact opposite of what the Semantic Web is trying to achieve.

This is why we see genealogy as a generalisable problem. Crack some problems in the genealogy space and you might be solving some fundamentals for the Semantic Web - and vice versa. and

[1] See and
[2] It might be worthwhile looking at the Google search for this term,
[3] See

Thursday, September 4, 2008

Working towards a Vertical Language

As part of our concept work here at Jodoro, we are working on a computing language, which we internally call Teme.

Anecdotally, there appear to be a lot of languages. The memory is probably a bit sketchy by now, but I recall from my University lecture days, there were circa 100 viable High Level Languages in the late 1970s. Today this would is in 1,000s, depending on how you'd classify a viable language. Anecdotally, in my own work I'm increasingly aware (and somewhat overwhelmed) of the number of language and technology options that are available [1].

Personally, I think the increase in computing languages in the recent era boils down to a few factors:

  1. (Ever) increasing availability of processing power makes dynamic languages more accessible, and is increasing the (broader) interest and activity in this domain.

  2. Environments such as Eclipse allow make it very easy to create basic tooling for a new language. Adequate tooling is a common impediment for a language gaining adoption. I could also point at other key examples, such as Apache Ant or the LLVM project.

  3. I'm sure some pundits will disagree, but I believe dominance of languages such as Java, C# in corporate environments has allowed niche areas to develop.

  4. The Open Source Movement has greatly enriched the amount of libraries and useful code that is available. This can give languages the "shoulder of giants" kickstart they need. There are innumerable examples, but LLVM is another great example for this.

  5. We do more computing nowadays - and the computing we do is more novel.

... I think that some observers can underestimate how significant the last point is. As the computing world evolves, languages are changing in their focus and capability. A computing language is, first and foremost, a means of getting a human to tell a computer what to do. Most significantly, what we are asking computers to do is broadening day by day.

However, all of this begs the question, what is a language exactly? [2] Any significant software solution will eventually result in what I'd describe as some form or subset of a language - A large SAP or Siebel installation may nominally be running on a Java platform, but you'll need to understand the semantics of SAP or Siebel to actually develop anything. It's possible to argue that many developers will be developing Domain Specific Languages in their solutions without really realising it [3].

Equally, you may look at Microsoft .NET as a set of languages - However, at the same time, these languages all share the same underlying class library. Arguably, the "language" of the class library is a more significant than the specifics of C#, VB, or another .NET language.

This (perceived) mayhem of different languages is compounded by a degree of flux with a number of emerging technologies. For example, a lot of institutions are investing in Business Rule Engines [4]. There are a variety of these engines available, each with a different area of specialisation and their own inherent language and tooling. There also also other technologies that are emerging rapidly in corporate environments - Business Process Execution is another classic example.

With that in mind, you could consider the Google search box a language. I often use "1 USD in AUD" in Google, or "Time in London" (there are dozens more, such as calculations or stock prices). It's a way from a formal grammar, but it's a new language construct that we're all using every day. The nice thing about Google is they obviously have mined and researched these kinds of queries and catered for these common examples. It's a language that is evolving in response to the actions of users.

So why are we developing another one? To accommodate the novel things we want our system to do (novel in our minds in the very least). As Paul Graham points out as part of this Arc Language FAQ - It would be surprising if we didn't still need to design more languages. [5], [6]. If you need to work in domain that is any way novel, there is a good chance you need to start thinking about the language constructs required.

There are a number of specific reasons why someone might start developing a language in earnest. For example, there is a lot of buzz around the Semantic Web at the moment. A lot of the effort in this area has focused on the SPARQL Query Language. The development of this language, and other standards, are absolutely fundamental. For us, there were a few drivers in taking starting to look at our own language:

  • As an intellectual exercise.

  • To explore a model of computing that is highly parallel and distributed.

  • To address a particular problem domain that is unapproachable or clumsy in other languages.

In particular, we are interested in increasing the parallelism and distribution of our systems. This is key in constructing a system that can both (1) process large data sets in "user time" as well as (2) be capable of scaling to meet any increase in user demand. My previous article on Map-Reduce discusses one language construct for parallelism - we're keen to take this further and into other processing domains.

Developing in a language is a useful exercise. Developing a language lets you put a lens your work and view it in a different way. If you take your problem and look as a language problem, you see the semantics in a new light. Even if you're working in "another language" such as Java or Ruby - it's still useful to think about the constructs your are building as a language and work towards that. This is a key philosophy to languages such as Lisp [7] , but it's a still a fundamentally useful exercise regardless of your target language.

How to go about this is the next question question - It's very easy to set up the key axioms for a new language or approach, but the difficulty arises as you inevitably encounter trade-offs along the way. In our effort we're continually walking a line between having something ultimately flexible, against having something that remains interoperable.

In later articles I'll write about how we decided to go about this (for better or worse) and some of the motivations and experiences that drove our efforts. I'll also discuss what we mean by a "Vertical Language" and how that is shaping our end goals.

[1] Eric Lebherz, HyperNews Computer Language List,, 2005
[2] Trying to define what a language is semantic is perhaps the ultimate tautology, but hopefully you get my point. Wikipedia has an article on Computer Languages that is less philosophical.
[3] This is perhaps expressed more succinctly by Greenspun's Tenth Rule
[4] I've personally come across three Business Rule Engines in the same organisation - ILOG JRules, Fair Isaac Blaze and Experian. All very different rules engines, with their own language definitions.
[5] Paul Graham, Arc Language FAQ,, Year Unspecified
[6] Paul Graham also has a great essay called the The Hundred-Year Language that is well worth reading.
[7] Lisp pundits call Lisp a "Meta-Language" for this very reason.

Wednesday, September 3, 2008

Looping in distributed programming

for (init_expression; loop_condition; loop_expression)
        program statement;

while (loop_condition)
        program statement;

The simple ‘for’ and ‘while’ loops above have so permeated so many programming languages that it's hard to imagine a programmer that hasn't written them innumerable times. So much so that most programmers would not bother to question its function, it is after all intuitive and self explanatory.[1]

However, as we started to look at parallel distributed execution, we realised the dynamics of these constructs are pivotal. This essay will dare to delve deeper into the simple loop construct to question how to make it perform. In particular I will focus on loop parallelisation.[2]

Historically the vast majority of loop constructs have been written to execute in a single processor, and that is exactly how the creators of the original construct intended us to use it. However, over the last decade we have seen the emergence of some large data sets, not the least of these being the Internet. This provides many classes of problems that require us to loop over such large data sets that a single processor, no matter how fast, is simply not going to return a result within tolerable real-time constraints.
Simultaneously, processor vendors are moving to an increasingly “multicore” strategy. Intel recently told developers to prepared for “thousands of cores”.[3]

There has historically been a happy marriage between the way humans think about problems and the way computers execute them – both, even if multitasking, essentially boil down to linear execution. As such the programmer writing an algorithm typically thinks and writes one step after another. But when it comes to distributed computing, suddenly we are asking computers to work in ways many people cannot – at least not alone. Programming for distributed computing expects the single programmer to derive and (somehow) write an algorithm as if an undefined number of machines would be collaboratively executing it.

With this in mind, it becomes abundantly clear that our timeless loop construct is inadequate for distributed computing. We either need a construct that informs the computer network what can be distributed, or we need an interpreter that can determine at run time what can and can't be distributed. Of course simply because something can be distributed doesn't necessarily mean it should – for example if the data set is small the distribution could be all overhead. However, there is no doubt there are many problems for which this isn’t the case.

Let us first look at the different types of loops to help us understand which can and can't be distributed.

Independent Loops

The first I like to call the Independent Loop. Such loops will execute the same logic/function over an array of data. It might be for example 'scoring' the relevance of a set of data that has been collected. The key to this type of looping is that the only data being modified is the data being looped (or the corresponding element in and identically sized 'result' dataset). An example of this type of looping is:

int data[] = {5, 4, 1, 6, 3};
for (int i = 0; i++; i < data.length)
        data[i] *= data[i];

The advantage of Independent Looping is that it can be maximally distributed. We could comfortably split 'the data' into any number of arbitrary parts and execute the same multiplication 'for' loop across each part. Concatenating the resulting data sets would leave us with an identical result as to if we had executed the data set in the single algorithm.

This type of looping is essentially a form or superset of Map-Reduce as discussed by Jon in his article on "Map-Reduce and Parrallelism" posted in July, 2008.

Accumulative Loops

The next loop type I call the Accumulative Loop. An Accumulative Loop is similar to the Independent Loop in that the function is executed on a single dataset one element at a time. However, rather than (or as well as) modifying the dataset element, the algorithm is modifying at least one variable outside the loop. What is key in this type of loop is that the sequence of modification to the external element isn’t important. This is an important distinction as if the outcome is dependent on the order of execution, then the algorithm is not deterministic in a distributed environment.
An example of a use of this kind of loop is to sum the values of a dataset:

int data[] = {5, 4, 1, 6, 3};
long sum = 0;
for (int i = 0; i++; i < data.length)
        sum += data[i];

As with the Independent Loop this kind of loop can be distributed. Care obviously needs to be made to ensure the various versions of the externally updated variables are ultimately reunited.

Dependent Loops

The last type of loop I'd like to introduce is the Dependent Loop. This is one where the result of one iteration is dependent on the last, or the outcome of the loop is dependent on the order of execution. The dependency may be data, or control related. A data dependent example is:

int data[] = {5, 4, 1, 6, 3};
for (int i = 1; i++; i < data.length)
        data[i] += data[i-1];

One such control related dependency may even be the decision whether to execute the next iteration. This is common in while loops, for example the conventional algorithm for deriving the greatest common devisor of two numbers u and v:

while (v != 0)
        temp = u % v;
        u = v;
        v = temp;

When the result of the next iteration of the loop relies upon the former we cannot distribute the execution. Instead, if possible, the code should be rewritten to be Independent or Accumulative.
An interesting observation is that the boundary case of infinite looping is a class of Dependent Loop. This is because the Independent and Accumulative Loops iterate over a predefined dataset. The infinite loop is a particularly dangerous boundary case in distributed programming as if released undetected it could literally consume all of the resources. Most applications will have an infinite loop somewhere, generally as a message processing construct (equivalent to the while(true) { } in your Java Runnable) - but these are implictly dependent because generally these loops will be consuming some kind of resource (i.e. a message queue or similar).

Of interest to me was that all three types of algorithms can be written using 'for' loops (or even 'while' loops), and hence the construct is actually too flexible to be useful in distributed programming. If for example we have:

for (...) { blah.doSomething() }

It is very difficult to determine whether this loop is Dependent, Accumulative or Independent. This may be even more difficult in a dynamic functional language such as Ruby or Lisp as you might be using function pointers that are unknown at compile time. [4]

Instead in our development we have introduced new constructs in our distributed language that allow the programmer and the computer to cleanly separate these types. The programmer must specify their intention at design time and this significantly simplifies the compiler's task of determining which loops will be distributed and which will not.

[1] The Do loop was first introduced in FORTRAN 66 in 1966. See:
[2] For details on other loop optimisation techniques take a look at
[3] See
[4] This same issue is referenced in Jon’s Map-Reduce and Parallelism article.