jodoro

It's been quiet...

2011-05-05T15:24:00.001+10:00

It's been quiet on the Jodoro blog for some time. Main reason is that we've combined the Jodoro tech with Culture Amp.

We've got the same focus combined with a lot of new and exciting ideas and energy. You might be interested in:

Process Amp -
Smart Checklists that will revolutionise the way you run & refine your business.
Cadence -
Performance management and continuous feedback for your people.

Keep in touch! jon

The Global Justice XML Data Model

2010-08-27T17:35:00.010+10:00

THE PROBLEM

The Global Justice XML Data Model - (GJXDM - Wikipedia Entry) is an XML interchange format used by law enforcement and other justice agencies in the United States.

It's a replete standard - it contains over 400 complex types and around 150 simple types with a total of around 2,000 associations (properties). Almost half of these focus on the Activity area (such as an Arrest) and on Personal Details.

Actual adoption of the model will vary depending on the context. This is due to expected variances in implementation, but also to the context of the application. GJXDM is implemented across a diverse range of institutions at different levels of government, each with different concerns and underlying objectives. The GJXDM has also been adopted in geographies outside the United States. Naturally these implementations have requirements that go beyond the available concepts, but also find large sections of the model inapplicable for their region.

The challenge is how to adhere to the entrenched standard whilst also accommodating the necessary variations on specific implementations, and keeping in sync with any updates to the standard over time.

HOW GRAFT HELPS

We believe Graft offers a lot to users of GJXDM. The first benefit is that it allows users to visually navigate GJDM as a domain model.

A key feature of Graft is the ability to extend other data models - we allow this extension in two modes: Active or Passive.

All of the elements of a passively extended model appear "greyed out" or "ghosted" in the modeling tool. You can then opt to selectively bring each of these elements into your model, even potentially over multiple releases of your implementation. To get an idea, you can extend the GJXDM Model yourself and experiment with drawing forward the parts of the model that are of relevance to you. Your modifications remain private until you explicitly choose to make your extended model public.

This allows you to cherry-pick the elements you need as they are implemented. Instead of handing out a schema with thousands of elements, you can produce a schema (by visiting the export tab of your model) that shows the elements and relationships actually being used, whilst still remaining consistent with the source schema.

An ancillary side-effect on this is performance. Large schemas can introduce a parsing bottleneck; reducing schemas to only include elements actually in use can make a big difference.

GRAFT KEEPS CUSTOMIZATIONS IN SYNC

Another key benefit of Graft is the control over extensions and modifications to the model. If you actively extend the source of GJXDM or passively extend the source of GJXDM, you are not left stranded on a standalone branch of the model. Your model is actually kept as a delta from the source model, allowing you to easily upgrade to future standard changes.

As an example of extensions and modifications being kept in sync with changes to the source GJXDM model, you might rename an element like LocationPostalCodeID - particularly if you're implementing the model in a very specific geography. This name change is stored as a delta, you can then update the underlying source GJXDM model and this change would still apply whilst still enabling you to get all of the updates to the underlying model.

Changes that consumers of GJXDM make can even be re-incorporated to the original model by the original model's administrators. A future update may transparently include the updates from derived models.

This is a very powerful outcome. In many modeling exercises tools encourage you to take a model, customize it, and effectively create an island. Graft encourages a different approach - Instead of grabbing a model and morphing it - Take an existing model, only use the pieces you need, focus on the changes you need for this release alone, and then keep taking the advantage over time.

YOU CAN LEVERAGE FROM MULTIPLE MODELS

Graft doesn't limit you to extending just one model. You can extend and integrate any number of models. It's unlikely that GJXDM is the only model you need to use. There may be other local and international standards, as well as in-house and bespoke representations for areas that are perhaps not covered or not covered appropriately for your needs in GJXDM. Graft lets you bring all of these models together, integrating and leveraging elements wherever required.

The Specify Tool in Graft lets you specify any number of Active and Passive Extensions, you can even modify these "on-the-fly" - for example, by replacing an existing Active Extension with a later release, or even with a different implementation.

We'll be working a lot more in and around Industry Standards such as the Global Justice XML Data Model. If you're working in this space, we'd love to hear your thoughts, comments and feedback.

Thanks,
jon@jodoro.com

Business Process Modeling - What's Your Purpose?

2010-08-22T16:27:00.004+10:00

When taking on a Business Process Modeling exercise, it's good to outline the purpose up-front. Some of the key perspectives that you want to consider are:

- Business Process Re-Engineering (BPR)
You want to examine the Business Process to make improvements - for example, increase automation, reduce duplication, streamline and parallelize.

- Execution
Intent is to take the Business Process and execute it on a technology platform, such as IBM WebSphere Process Server. Often this is part and parcel of increasing automation, or streamlining the technology.

- Instrumentation
Intent in this case is to design around measuring and monitoring a process. You might want certain customer orders or interactions resolved in a timeframe. Or you may want certain items to be escalated after a critical time period has elapsed. Or you might want to be recording data points that can be accumulated and mined for patterns after-the-fact.

- User Interaction
Aim here is to model the user interactions with a process. Central focus in these models is naturally the users, their team structures, skills and locations. There are a number of reasons for doing this; skills and role realignment. Building and refining escalation structures. Ensuring privacy and clearance compliance. Optimising team structures. Optimising and perhaps consolidating locations. Undertaking outsourcing or offshoring.

If you draw a business process from one of these perspectives, most of the time they will look completely different from the others. This can present a potential pitfall for process models. For anything of any complexity it's nigh on impossible to get a model that incorporates all of these concerns adequately. Conversely, if your problem is simple, then these approaches are probably overkill.

An example might help position this better -- If your aim is BPR, you might put each human task in sequence, even though these are effectively done by a single person all at once. The reason you put them in sequence is that you have data that tells you how long each individual piece takes, and you also know that they are usually done together. When you simulate, you get an accurate picture of how long the macro pieces take, and where critical paths exist in the process. You also get some measures of complexity - a major one being the number of unique paths through the process.
(As an anecdotal aside - I heard of a major utility that mapped their provisioning process - and the number of unique paths exceeded the total number of customers)

So, someone processing an order might check the customer's credit, validate their address, check the shipping costs, check stock levels, enter the order and then submit it to be fulfiled by the warehouse. However, it's unlikely that anyone will do those tasks in that exact order. There might be dozens of reasons of this, a common one simply being the order of papers in a pile.

This is a trivial case, but it could be significantly more complex with something like processing a mortgage application, or a business loan, where there can be dozens (or indeed hundreds) of fragments of information.

The intent in this kind of process modelling is usually to uncover overlaps and efficiency opportunities -- in processing a new customer order, you might be validating a customer's address numerous times. This could ideally be reduced to once; or twice if you have a Quality Assurance stage.

The problem is this doesn't necessarily represent the process in a way suitable for other objectives - such as execution, user experience or instrumentation. I've seen this happen before. The process gets defined and then forces a user to do a sequence of tasks in a strict order -- when the reality is the user is sitting with a pile of paper in front of them and probably wants to do them in whatever order is convenient. Worst case scenario is this macro task gets formalised as numerous minor tasks that must be checked in and out of work queues, or end up as a horrendous sequential "wizard style" User Interface.

Since the original intent of the exercise was to re-engineer the process to make it better, this is a counter-intuitive outcome. However, without going into that detail and making those assumptions, you couldn't have assembled and simulated the process.

In a similar vein, this process implies that you can instrument the "validate address step", whereas the reality is that this step may well be embedded in person shuffling through some paperwork. It's not possible to get data around this individual step; not in any practical terms anyway. Going even further, all of this might be completely irrelevant from an instrumentation perspective -- the key KPI might be customer satisfaction, which is likely measured in a completely different way.

This is not to say that Business Process Modeling doesn't have significant value. Part of the issue is the hubris that surrounds Business Process Management (BPM) software, which really pushes this as a "new paradigm". The idea is that you sketch out a Business Process and then the software is capable of (magically) executing the process. However, this is really impractical. Tooling can help significantly, but it's a means to an ends. Mapping and understanding your process is the intrinsic value; software enhances or amplifies that.

An approach led by Business Process Modeling can be a significant advantage to the deliver of software projects. It's an excellent means of driving out requirements and outcomes. Just be clear about the purpose up-front, and don't get fixated on auto-magic tooling. Even if you sketch your process on paper, and then code it from scratch, you'll be getting many of the core advantages. Add tools and technology on top to maximize the advantage, not define it.

Jon

New Graft Feature: XSD Exports

2010-07-20T20:45:00.011+10:00

Graft now has the ability to generate XSDs for any model. Visit the export tab of your chosen model.

THE DECISION

We have chosen to generate a ComplexType and an accompanying element of the same name for each class that doesn't represent an xsd primitive type. Each class association is represented within the ComplexType definition as either an element or an attribute depending on whether or not the association's type is one of the xsd primitive types. We optionally support a namespace, but at least at this stage further external namespaces will need to be added to the schema post generation.

THE DEBATE

XSD generation is a feature that has caused a fair amount of debate at Jodoro. This is because the XSD 1.1 standard provides many different ways to represent the same concepts.

For example, the Global Justice Data Model defines all of its entities and their associations in ComplexTypes and then defines an element of the same name to take each type. Each ComplexType tends to contain a ComplexContent which in turn may extend an appropriate ComplexType and/or define associations to elements of the defined ComplexType. This model provides maximum flexibility because a valid xml schema could contain any chosen subset of the defined elements. This is particularly useful when many different types of software systems need to consistently communicate with each other.

In contrast the CellML and FieldML schemas define very few top level elements (in fact one), and tend to fully define associations between ComplexTypes within the ComplexTypes themselves. This approach allows for a more formally structured approach to "valid" xml structures, which can be helpful in sharing information between very similar systems. Even these two very similarly structured xsds differ as FieldML does not use any namespacing and treats everything as a ComplexType while CellML utilizes the "cellml" namespace and defines both ComplexTypes and SimpleTypes.

The Schools Interoperability Framework (SIF) defines an element for each ComplexType, but demonstrates yet another structurally different way to build an xsd by defining very few named ComplexTypes and creating many nameless ComplexTypes in their definitions. This directs the consumer's focus to the named ComplexTypes, but comes at a cost of comprehension of the ComplexType definitions (and results in many nameless classes in the domain model). Like CellML, the SIF standard also defines SimpleTypes and attributes differentiating them from ComplexTypes and elements by whether or not they extend the xsd primitive types.

Beyond these structural differences we also needed to contemplate whether and how we support concepts such as enumerations. The issue for us is that this concept blurs the borders between meta-data and data. Technically each enumeration value is one of the valid instances of the enumeration type. It is tempting (and very common) to define enumerations within xsd schemas, particularly when the values are unlikely to change. However we would argue that even if the values won't change a better approach is to define the enumeration as a code of type string or integer, and to store and maintain the valid values outside of the schema. This provides a cleaner separation between structure and business rules.

We also discussed and debated many other commonly used xsd concepts such as "pattern", "maxlength", "union", "choice", "key", "any" and "all". At this stage we have chosen to leave all of these out as they are defining concepts that we are not currently explicitly representing within our domain modelling tool, and because we feel that, like enumerations, many of their uses are often business rules and arguably shouldn't be defined in the schema. If you have suggestions or issues with our current approach, please email myself or support@jodoro.com.

Doug - @douglasenglish .

Graft Rails Model Inheritance

2010-07-15T16:35:00.003+10:00

We've pushed out a new feature for Graft.

Up until now, each exported class has directly translated across to a Ruby on Rails Model, without regard for inheritance in the models.

Exported Rails Models will now include all the primitive of the Graft class, as well as any superclasses. So if you have BMW inherit from Car, and Car has a "license" of type "String" - both Car and BMW will have the license property. The same applies to relationships.

We've put up an example model that should help illustrate. You can export it to Rails using the instructions listed under the Export tab.

It's fairly new, so if you've got any queries or strike any issues, drop us a line via Twitter @jodoro or email support@jodoro.com.

Right now all classes are generated as Models, so you can consider this the superset. Later releases will allow for finer-grained control, such as flattening (e.g Single Table Inheritance).

Jon - @jonathannen.

Developing Business Process Models with Domain Models

2010-07-14T13:09:00.005+10:00

Quite often Domain Models are developed around or in concert with Business Process Models. This often leads to a the question "how do they inter-relate?" (and subsequently, "who's in charge? what drives the definitions?").

It's important to relate Domain and Business Process Models, but the approach for doing this shouldn't be too onerous. At the most basic level, each Business Process step should have input and outputs that are driven off the Domain Model. If you are processing a shopping cart payment "Process Shopping Cart", the input may be "Shopping Cart" and the output "Invoice", both concepts in your Domain Model.

Generally this is applied at lower levels of a Business Process, L2 or L3, but there is nothing to stop you working top-down or bottom-up. For this type of exercise, I believe a top-down approach is the best fit. Bottom-up has a tendency to get bogged down in the detail (the wrong kind).

The obvious exceptions are cases where the low level processes are well understood. This can occur when the Business Process is already operating or established in another form. These existing processes often have associated pre-existing MI, data or other statistics - and often this information can be quite fine-grained. In these cases, bottom-up isn't a bad place to start.

Whatever the level, it's advantageous to elaborate on the details of the inputs and outputs. What elements of the Shopping Cart are required? Which are optional? For each domain concept, the key questions are "how is it used or applied?", "if it's optional, what are the rules?" and "what's the context?". Aim here should really be to reduce the inputs and outputs to what is absolutely necessary - this is for a few reasons:

It best informs your Domain Model.

It best translates out to other representations, such as a technology implementation - Clarity here will have numerous downstream benefits, particularly in the implementation.

The act of reduction/distilling itself is a good instrument to drive the exercise.

It's easier to see what's missing. Blanket terms can hide conflicting points of view and assumptions.

It greatly assists the construction and execution of tests and user-acceptance.

Finally, in the heat of delivering a project, it can sometimes be hard to argue "why?" against business requirements. This is an ideal opportunity to ask.

That's just a quick summary for now. We'll dive into some of the advantages and pitfalls of models led by Business Process in a later article.

Jon

Workshop your Domain Models

2010-07-13T16:17:00.002+10:00

The value of a Domain Model is often overlooked. Done properly it defines the structural integrity of your software system in a language and an entity relationship structure that can be understood unilaterally by the business domain experts and the technical specialists. As such it significantly reduces upfront requirement errors, and better structures the solution to support future change requests.

But a domain model must be owned by both the domain and technical experts on the project for it to realize this value. I have utilized the following facilitation technique, borrowed from Feature Driven Development, to achieve this ownership on several of my projects. It was originally designed for Domain Modeling, but the techniques would equally add value to any form of workshop.

Identify a facilitator. Yes, cliche, but also amazingly too often overlooked. Choose the facilitator wisely. They should not get dragged into the technical debates, and shouldn't be afraid to call time. Keeping the meetings moving is critically important.

Identify a documenter. Encourage all of the participants to jot down points throughout the workshop - it's the documenters job to collate and distribute these at the end of each day. Focus on decisions and actions, backed up with justifications. Include lots of photographs - it is amazing how unifying their inclusion makes the process. The documentation should be released by close of business on each workshop day.

Keep the numbers balanced and low. It's of course important to ensure the appropriate stakeholders are represented in the workshop, but the key word is represented. Too many active participants in a workshop will grind its productivity to a halt. I've found that between 4 and 8 people tends to work well. Keep in mind that the workshop participants should also include appropriate representation of the domain experts and the key technical implementers. As close to 50:50 as you can orchestrate the better.

Choose your location carefully. Pick spacious, quiet and light rooms. Come prepared with appropriate equipment. Ideally choose a location that's outside the team's usual working location. You want to encourage full participation without distractions.

Define team norms as a team. (And make sure "have fun" is on the list!). Don't knock this one until you've tried it. Asking the team to define their own norms is an excellent way to ensure all team members adhere to the ground rules. Punctuality? Mobile phones? When is lunch?

Take lots of planned breaks, and finish early. Done properly such workshops are amazingly exhausting. On top of this, the team bonding experience of the breaks is perhaps almost as important as the workshops themselves. Make sure you finish the workshops early each day. I often finish them at lunch time. It's important to ensure the participants have time at the end of each day to take care of business as usual activities, and that the documenter has time to compile the days results. Morning are usually ideal for modelling - people seem to be fresher.

Reach consensus on issues. Yes in an ideal world the entire project team would immediately and unanimously agree to each decision, but this isn't an ideal world and humans are good at arguing. It's a good idea to give one person, perhaps the project's solution architect, the ultimate veto on decisions. Coming to consensus is a commitment from the team that, even if they don't 100% agree with the selected approach, that they'll at least live with it. Once a consensus is reached it's final. The last thing the project needs is for the same resolutions to be requestioned in a corridor three days later by a subset of the participants.

Foster diversity. A commonly used Feature Driven Development technique is to break the larger group into two or three subgroups (ensuring domain and technical representatives remain in each subgroup) and ask each to model the same problem space concurrently. After a set amount of time each subgroup resents and the group as a whole drives towards a common single solution. This may mean accepting one of the subgroup solutions entirely, or it may mean merging components of each.
Keep a "Parking Lot". I've seen the same technique referred to by many names. Essentially it's a team maintained list of topics that need to be closed out by the end of the workshops. It provides a mechanism to capture ideas as they're generated by the team without interrupting the currently flow of activities.

Pulse check progress. At the end of each day ask for feedback. What's working, what could work better? Is the room too hot? Were the brainstorming sections long enough? This feedback loop can massively improve the experience and end result.

Formalizing the domain model within a workshop process provides a very useful metric. For every week you spend in workshops defining the domain model you can expect a ramped up development team to spend three weeks building the content. So if it takes you four weeks to lock down a domain model for a project that needs to be finished the build phase within the next two months, you should probably take another look at the scope, or the time-line.

Doug

Draw a Real Picture

2010-07-08T10:12:00.004+10:00

In a previous article I mentioned five tips for Domain Modeling. In this post I'm going to drill down further on one of those:

1. Draw a "Real Picture".

I've used Real Pictures as a technique for a number of years; not only in Domain Modeling, but in Architecture, Design, Development and Planning. It's a pretty simple exercise, but yields a lot of interesting information. In this description I talk in the context of running a project, because that's the most common case -- but that is not universal.

A Real Picture is foremost a conversation piece. It lets the stakeholders involved discuss and explore the domain and it's context. Besides just capturing raw information, a Real Picture also serves as a gentle introduction for participants that might not be familiar with formal diagramming approaches. Often a good first step.

To start, get the participants together. You don't need to mass-invite everyone, but make sure all the key representatives and stakeholders are covered. So if it's a software project, you should have sponsors, end-users, designers, developers, testers and so forth. If you need to economise on people, focus on those that have most affected or impacted by the end result of the exercise, rather than on pure expertise.

There isn't much to developing the picture itself -- Simply draw the major concepts and how they inter-relate. In my experience, this is enough to start the conversation flowing; just draw and re-draw the diagram as the "oh, and then there is" comments evolve.

As a rule of thumb, I keep Real Pictures to one page - A3 or a White Board usually. If it gets really messy, redraw the picture. If necessary, drop in prompters to keep the dialogue flowing. Pick a concept and drill down - focus on quantities, qualities, time, costs, constraints:

- How many of these items are there? e.g. How many employees are at that location?
- Does it vary over the day, are there seasonal variations? e.g. Do you have a rush at Christmas, End of Financial Year?
- Will this change or move soon? e.g. Are you changing network providers? When did you last open a new outlet?
- What would invalidate this element or relationship? e.g. What security issues would make a location unviable?
- What do competitors or other equivalents do for this function or outcome? e.g. Does your competitor do this differently? better?
- Are there constraints on the way things should be? e.g. Does the government enforce strict reporting on certain items?

It's important that you don't constrain the actual process and format too much. The objective of this exercise is to get all the stakeholders together and to articulate a landscape that everyone understands. The end result should also be something that all stakeholders reasonably grasps and agree to.

At this stage, try not to abstract too much - early abstraction can lead to some necessary detail or important corner-case being lost. If a specific location, or server, or piece of software is mentioned, it may have special significance.

In one particular instance I found a number of stakeholders would refer repeatedly to a specific printer - i.e. the actual device itself and where it was located. Most of the project team didn't see any significance of this, myself included. The model and design assumed it was a standard office printer, the same any printer we used in our day to day. When we eventually dug deeper, it turns out this specific instance was special - it enabled secure printing. The way the device functions is such that a human operator never gets to see the contents. If you've ever received a bank PIN in an special envelope, then it's probably from a pretty similar device.

Not only was the fact that this device was special important - but also many of the project participants hasn't realised the significance of the information being send by this device. The communications that ended up at this printer required an extra level of security; it has a very special significance. Up until that point most of the team hasn't considered that special, but clearly it was. Similarly, the end-users thought that information was obvious. Suffice to say, staying concrete at this stage usually is the best default - Abstractions will flow later.

A Real Picture sounds trivial, in fact you might be over-the-top to call it a technique. Either way, it's a low impact exercise with a lot of benefits. At the end of the exercise you have:

A starter for the vocabulary that is in use. Perhaps more importantly, you've got a vocabulary that a variety participants are at least familiar with.

An outline of scope. With the Real Picture you should be able to draw a line around what's "in" and what's "out". If something's ambiguous, explore that some more. If scope has been defined as part of a project, this can be a useful validation point.

A fallback. If you hit a road-block or misunderstanding later, you might be down in some detail and a representation that some might not understand. If you have this, you can ask "but didn't we say these relate in that other diagram?". It's not the absolute source-of-truth, but it's a useful pivot.

An understanding of the various perspectives of the stakeholders. Usually the Real Picture will help illustrate the natural focus and bias of the stakeholders.

If the participants can't understand or arrive at a picture, then the composition or overall scope needs addressing. Every participant in the project or domain should at least be able to understand the A3 view of what you're trying to realise.

Jon

Customizing large industry standards

2010-07-07T14:32:00.009+10:00

THE PROBLEM

Throughout our consulting engagements with corporates attempting to implement industry standards, two predominant issues keep recurring:
1. Customization of the standard means forking from the standard. Taking updates from this point is at best extremely labor intensive, and at worst impossible.
2. Very few organizations want to implement an entire standard from day one. A more typical usage is a phased approach with core services incrementally released over several years. Absorbing the entire standard up front results in service consumers being expected to understand and work with verbose and confusing generated artefacts (such as XSDs) filled with far more data fields than have actually been implemented. In fact I'd hazard a guess that there are very few cases where more than half of those data fields will ever be implemented.

WHY?

The integration of data and its structure is a major problem for almost all organizations - the bigger the organization the bigger the problem. We have seen many organizations look towards domain industry standards to provide a more structured blueprint for internal and external integration.

Business-to-business standards tend to be fairly effective because each organization involved has a strong financial incentive to adhere to the standard. An example is the AS2805 standard for Australian EFTPOS transactions, which itself was based on the ISO standard 8583.

Internal integration is, however, another story. If you've ever worked in the IT arm of a large organization you'll know what I'm talking about. Almost all software development is project based, and it's an all too common story to hear that the 'refactoring' was descoped to 'phase 2' which - surprise, surprise - never actually ends up being funded.

However, for those lucky enough (or perhaps unlucky enough) to participate in the rare sort of project that does attempt to implement the 'phase 2', you've probably also discovered that adhering to an industry standard is anything but easy. For starters there are few tools available for dealing with standards. Most open standards are available in XSD format accompanied by MicrosoftWord, and MicrosoftExcel documentation, and most proprietary standards are accompanied by equally proprietary vendor lock-in product sales. Beyond this, no standard is ever quite what you need, either missing large areas important to your business, or overly bloated due to a missing-the-point attempt to please all consumers all at once, and more commonly a little of both. Almost all approaches require the consumer to start with the entire standard and then customize for their needs.

SO WHAT?

As practitioners in this space, and still recoiling from the war wounds, these two problems were front of mind in the inception of http://graft.jodoro.com.

Customization

Every model in Graft is internally stored as tiny deltas. When you extend an existing model in Graft you simply add further deltas to the same base structures. Version control and management is baked into the framework of the application. As such, even if you completely rename an existing class, you can easily adopt any future associations to the base class within your extended model. When and what you adopt is controlled automatically via "releases" configured in the 'specify' application.

Templating

In Graft we provide users two distinct ways to extend a model:

1. The most commonly understood approach we call 'active' extension. This quite literally means you start with the entire base model in your scope. Graft then allows users to explicitly descope parts of the model they don't wish to use. Such descoped classes will remain in the default view, but grayed out. This allows users to see what they have removed, and provides the ability to reverse the removal at a later point. Although users can still see grayed out descoped items, such items will not be included in any exports. An example of an actively extended model can be found here: http://graft.jodoro.com/models/31881

2. We are particularly excited to introduce a new approach to extending that we call 'passive' extension. This allows users to begin with everything descoped, and to draw forward only what is of utility right now. This is the approach we strongly recommend be used when working with large industry standards. An example of a passively extended model can be found here: http://graft.jodoro.com/models/31893

Often organizations wish to borrow from more than one standard. Perhaps, for example, the street address structure is particularly weak in their chosen strategic industry standard. Graft allows any number and combination of active and passive extensions within the one custom model. This is again configured in the 'specify' application.

Feel free to extend these and any other public models within Graft. By default extensions of any models will remain private. You can however choose to make your extended models public too by visiting the admin tab.

Doug

5 Domain Modeling Tips

2010-07-06T12:25:00.010+10:00

It's been a while since we've updated the blog. In the meantime, we've been working hard -- both consulting in the domain modeling space, as well as preparing the latest release of Graft. Given this, we're going to blog less about the technical implementation - and more about using Graft and the process of Domain Modeling.

To start we thought we'd discuss five key tips for building a domain model. We find these are generally applicable, but how and when you implement them will depend a lot on context. Key factors include the time at hand, pre-existing collateral, requirement maturity and the availability of Subject Matter Experts.

Here is a quick run down for now. We'll flesh these out over the coming weeks:

1. Draw a Real Picture
This has a lot of names - In this case it is just an informal sketch that every stakeholder will understand. Typically this is done on a white-board, but paper or online work just as well. Emphasis here isn't on building a refined model, it's about getting the landscape view, establishing an initial vocabulary and providing a basis of scope.

This picture doesn't have any formal semantics, it functions more as a conversation piece to illustrate and start grappling with the domain. Build this using whatever works for the participants. That said, if done right it can be an opportunity to gently start introducing some key modeling concepts and approaches.

2. Goal-Based Workshops
Many modeling exercises risk becoming esoteric. We encourage goal-based modeling that drives towards outcomes. Highlight the gaps, inconsistencies and pitfalls. Uncover the gnarly pieces early. You might not necessarily solve these immediately, but surfacing them is a key outcome.

Often the Goals themselves are simply driven out through workshops and consultation. In the ideal case you can derive Goals from sources such as strategic initiatives, requirements definitions, existing systems and business processes. If these aren't available or mature, focus on how the outcomes will be validated.

3. Deep-Dives, with a focus on Cohesion and Coupling
This is really where the heavy-lifting of modeling comes in. A deep-dive is a concentrated effort on an important aspect of the domain. Often the Goal-Based Workshops will help to drive out the topics.

The frequency and function of these will need to vary based upon the topic involved. Our ideal is a set of tight iterations on each topic. However, a common compromise is to run a deep-dive session, have the modeler refine the results offline and then a hold final session to reconfirm the outputs.

4. Revisit the Landscape
Deep-Dives generally uncover a lot of detail, however, before going to far down a rabbit-hole it's good to pause and check "is it really relevant?". The key question here is always "what's the context?".

Ideally you will have introduced Business Processes as a requirements driver earlier. However, if this hasn't occurred, this is a key opportunity to introduce them. The Business Process will let you ask a lot of key questions around model attributes and relationships - Are they used? When and what for? This can lead to surprising results, often for aspects the participants take for granted.

5. Iterate, Evolve & Refine
No model is ever "ultimately complete"; in fact it can be damaging to have this as a goal. It's important to plan for how a domain model can and should change over time. This can include small things - from notes and documentation on why decisions have been made and considerations for the future - through to a full roadmap on how the model needs to progress from here.

That's a very quick wrap for now. If you have any other tips, let us know and we'll see about exploring them over the coming weeks.

Jon

Twitter Updates

2009-07-22T18:44:00.006+10:00

We'll be publishing more of our release notes through the @jodoro Twitter feed. Particularly now that we'll have more regular releases as we run the Graft Alpha.

Follow Jodoro on Twitter here.

As always, feedback is welcome!

Graft application Alpha release

2009-06-26T14:12:00.005+10:00

Jodoro has today released an Alpha version of Graft (http://www.jodoro.com/graft), an online collaborative modelling tool. Graft is publicly available and free to use for those happy to accept that their models will be open and accessible to all other users. Jodoro encourages users to share, collaborate and leverage their models. Future releases of the Graft application will see the introduction of several export formats including xml schemas, and coding stubs for various languages, and a paid offering for users who wish to keep their models and model extensions private. Please feel free to provide feedback, suggestions and thoughts to feedback@jodoro.com.

Thanks, Doug and Jon

Jodoro in the Bay

2009-03-12T04:43:00.004+11:00

Doug and I are over visiting San Francisco and the Bay Area for a month or so. We've sublet a place in Inner Sunset, which we'll use as a base to explore San Francisco and the Bay.

Already been a very fruitful trip. In fact, it's initially overwhelming how much there is going on. If anybody has any tips on our trip, feel free to drop me a line.

Looping in Distributed Programming - Redux - Breaks

2009-01-27T23:40:00.006+11:00

In Doug's previous article on distributed loops he wrote about three types types of loop - Independent Loops, Accumulative Loops and Dependent Loops. His article goes deeper, but the basic determining factor is the dependency between each loop iteration. This factor is important to us as it directly translates into the level of parallelism we can exploit, and by implication the level of distribution.

As we've worked outwards we've also also started to find other important categorizations - generally under the (broad) banner of an Accumulative Loop. Our main new category centers on what would be called break in C-based languages - or it's often maligned cousins, continue and goto [1, 2].

Break introduces a control dependency between each iteration. In order to know if iteration [n+1] should execute, we first need to know if iteration [n] has decided to break the loop.

This is a bit of a pain, especially in cases where the loop is otherwise independent. The case we're regularly hitting is a simple search where you're only interested a single result.

For example, you might have a function that allows route-planning - maybe planning a series of flights, or bus trips. Rather than calculate some kind of optimum, in the first case you just want to throw together a quick example that the user then tweaks and edits. This might be useful to check if the route is possible at all, before diving into optimization. In this case you're only interested in the first match (if any).

In these cases I can theoretically evaluate a match against each individual entity in parallel. This issue is that after getting my first match, any subsequent iterations are superfluous.

So what are the options. I can:

Ignore the redundancy and iterate through the whole set - in parallel and distributed.

Build in a control dependency and break the loop when I find a match. This means each iteration must be sequential.

Allow a hybrid - Iterate in batches and allow breaks at the batch level.

Clearly the tradeoff here is the amount of redundant processing against the increase in parallelism. If you assume a simple, evenly distributed search space, on average option #1 will do twice the amount of work of option #2. If you greatly increase the size of the search space, this could end up being a lot of redundant work. However, as the search space increases the available parallelism increases also. Naturally it depends, but arguably you get a better "return" on the parallelism side as this has a number of other benefits - e.g. scale, response time and so forth. If you had infinite resources you would certainly adopt approach #1 and work as parallel as possible.

For option #3 you can make the best of the tradeoff if your batches line up well with your distribution model - for example, if you're CPU bound this might be that the batch size lines up with the number of cores available. You're able to exploit the highest level of parallelism that resources allow, whilst considerably reducing the amount of redundant processing.

It's semantically tricky to come up with a construct that makes this construct explicit in our new distribution language. Calling it "break" is the worst as it's overloading an existing term with different, but subtly similar connotations. What we're really after is a succinct way of introducing a "break-loop-at-your-next-convenience" statement. I've settled on "retire" for the moment, but it still doesn't seem to hit the nail on the head (suggestions welcome).

In terms of the technical implementation - Right now we're semantically catering for #3 by introducing control breaks, but under the hood we're implementing #1 by completely ignoring them. #1 is effectively the degenerate case of #3. Once our distribution technology improves we'll be able to exploit these cues better.

jon@jodoro.com

[1] Continue is in the same class generally, but doesn't apply in the specific case being discussed in this article - if you assume an otherwise independent loop, continue can be constructed using if/else semantics.
[2] e.g. http://www.google.com.au/search?q=break+continue+considered+harmful. Some developers put break in the same camp as continue and goto.

To Undo or not to Undo

2009-01-20T11:39:00.041+11:00

Here at Jodoro we’ve been working on the release of a Collaborative Modeling Tool designed to run out of a browser using Adobe Flex [1]. The basic premise of the application is to allow users to develop, customize, extend, or just make use of industry standards and shared models.

It's conceivable that the data sets (the models) being accessed by the client could get quite large. So we decided on an online, delta-driven communications approach. The Flex application maintains a local cache in memory for each session, retrieving data on demand when it cannot be found in the local cache.

As we started to design the user interface and discuss the sorts of features that users would expect, we naturally raised the topic of whether the user should be able to undo their work. Even though it seems to be often overlooked, we decided that Undo/Redo was important for the overall usability.

Our first thought was that there was a natural fit between the Undo/Redo actions and the delta-based communications. As such I was expecting the Undo/Redo features would be relatively straightforward to implement. However, it didn't take long for all the corner cases to emerge, and it turned out to be trickier than I’d first expected. Now that we have a fully functioning framework, I felt it worthy a blog article.

We started by discussing what our users would want and what did and didn’t work well in other applications that we use in our own everyday lives. It's somewhat common that a user can only Undo back to their last save point, or a preset number of steps, but we decided our application should preferably support undoing all the way back to the start of their current session.
One feature not often implemented is actually forking all the different Undo/Redo pathways as the user worked. However, we decided this was much too complex for the user to track, and settled on a linear progression.

We also muted on getting rid of Save entirely – instead synchronizing on every user action, or at some logical interval. Whilst this was technically possible, we decided that keeping the existing Save metaphor would be more comfortable for users. Finally we decided a cancel function to return the user to their last Save Point would also be useful (although not strictly necessary).

Framework

With my requirements in hand I set about designing the framework. I decided to capture a linked list, the Delta Chain, of all of the user actions that change the state of my model. To simplify the implementation I initialize the Delta Chain with a StartPlaceHolder Delta Object that, as the name suggests, always remains at the front of the chain. The framework keeps a pointer, currentDelta, to the most recently executed Delta Object, initially pointing to the StartPlaceHolder Delta Object. This pointer is represented in all of the diagrams below as a red triangle.

Figure 1: Basic action manipulation

Figure 1 demonstrates the frame-by-frame changes in the Delta Object chain as various user actions and undo and redo commands are enacted by the user. New Delta Objects are always added to the current end of the chain, and the currentDelta pointer is always updated to point to the newly added delta. As the user undoes their actions the pointer moves to the left, towards the StartPlaceHolder in the Delta Chain. As they redo the actions the pointer moves to the right. If a user has undone one or more items when a new action is added, the framework drops the undone Delta Obects and attaches the new Delta Object to the right of the currentDelta pointer, then updates the currentDelta to reference the newly added Delta Object. My StartPlaceHolder token can never be undone, and hence can also never be dropped.

I created an IModelDelta interface to be realized by all of my Delta Objects, and added the following methods:

        function doAction(m:Model):void;
        function undoAction(m:Model):void;
        function redoAction(m:Model):void;

My framework executes the doAction method when a Delta Object is first added to the Delta Chain (and never again). undoAction and redoAction are called for every Undo and Redo of that action respectively.

Save

Figure 2: Saving actions

I then moved to the SaveDelta (Figure 2), which in our case was an action to send model deltas in XML format to the server, based upon changes since the last save. The actual save itself is treated as a special case and can never be undone –the data has already been persisted to the server. If a user undoes back past a Save Point, the next user enacted save will calculate the necessary deltas (reversals) in order to make the model consistent.

The SaveDelta doAction generates the combined XML by making use of another IModelDelta method:

function generateXMLDeltas(m:Model):XMLList;

The SaveDelta doAction method traverses backwards along the chain of Delta Objects and calls generateXMLDeltas for each Delta Object until it reaches the last stable save point. One of my early realizations was that the last stable save point may not necessarily be a SaveDelta. It may obviously be the StartPlaceHolder, but it could also be a CancelDelta, which I discuss later in the article.

As in Figure 1, if the user performs an Undo, and then another task, the undone Delta Object is discarded. This approach works fine for any deltas that have not yet been saved to the server, but is more difficult when the user has undone past a Save Point – these cannot simply be discarded as we need to generate reversals at the next Save. There are a number of options to handle this:

The framework could generate the reversals upon discarding the objects, and update the server immediately. This simplifies the Delta Chain, but at the expense of increased server communications and inconsistency with the Save/Cancel metaphor.

The reversal XML could be generated and stored in memory until the next SaveDelta doAction method is executed. This again simplifies the Delta Chain, but means the Undo/Redo framework needs to also understand the reversal data. The XML generation also becomes disjointed.

The saved delta objects could be stored in a temporary list until the next SaveDelta doAction is executed. This creates an overhead of moving objects around arrays in memory, and is more complicated to build, but is actually quite a clean approach. It also takes the actions out of order which is fine unless you intend to implement a cancel function.

The objects could be left in place and marked to indicate to the framework that they are to be ignored except by the next SaveDelta doAction which will generate its reversal XMLs and drop it from the chain.

Figure 3: Undoing actions already saved

My current implementation is option 4 (Figure 3). I introduced a new flag called isDead and added four new methods to our IModelDelta interface:

        function isDone():Boolean;
        function setDoneFlag(d:Boolean):void;
        function isDead():Boolean;
        function setDeadFlag(d:Boolean):void;

The framework now determines whether a Delta Object should be discarded, or just marked dead based on whether the Delta Object had already been saved to server. Such Delta Objects are easy to detect as they have been undone and lie to the left of a SaveDelta in the chain. The framework also now ignores dead objects, skipping them in all Undo and Redo actions (along with all SaveDeltas). To everything except the next executed SaveDelta doAction method a dead Delta Object is dead! Finally I revisited my generateXMLDelta methods for each Delta Object and upgraded them to generate reversal XML deltas when the object is in a dead state.

It is also worth noting that there is a difference between an item that has been undone and lies before the previous SaveDelta, and a delta that has been marked dead. The first scenario happens when users undo actions that have already been saved, but can still be redone by the user. The second scenario represents items that would have been discarded if not for the fact that they have already been saved to the server – The user can no longer redo these actions, but we still need to generate a reversal.

My SaveDelta doAction currently iterates backwards through the Delta Object chain looking for:

Actioned Delta Objects that haven’t yet been saved. When traversing backwards from the current SaveDelta, these deltas always appear before a previous CancelDelta or SaveDelta (or StartPlaceHolder).

Undone Delta Objects that have been marked dead. Again traversing backwards from the current SaveDelta, these always appear after a previous SaveDelta. Technically the implementation can take advantage of the linear nature of the Delta Chain and only keep looking for dead Delta Objects while the Delta Objects remain dead. My implementation of cancel (details below) complicates this and I’m currently traversing the entire Delta Chain. At some point this will become a performance bottleneck and I will need to revisit to determine more accurately where to stop traversing.

Cancel

The CancelDelta needs to bring the system back to the last save point. This could be a SaveDelta, or the StartPlaceHolder, or it could be another CancelDelta. This is because the previous CancelDeltas themselves were resulting in returning the system to the last save point. Unlike Save, I chose to implement CancelDelta as an action that can be undone and redone. This adds complexity, but we felt also added a justifiable amount of usability.

There are two types of actions that need to be handled for a cancel:

Any actions that have been executed since the last save need to be undone.

Any actions that have been marked dead since the last save need to be redone and marked alive. (This scenario only occurs if the user has undone items that were executed before the last save.)

Figure 4: Cancelling unsaved actions

When all of the actions in the scope of the CancelDelta have occurred since the last stable save point (Figure 4), the doAction of the CancelDelta simply needs to undo these Delta Objects. The CancelDelta undoAction and redoAction should then respectively redo and undo these very same Delta Objects.

Figure 5: Cancelling undoes of saved actions

However, if Delta Objects before the SaveDelta have been marked dead (Figure 5), the CancelDelta doAction also needs to resurrect them and call their redoAction methods. The CancelDelta undoAction also needs to call each Delta Object’s undoAction, and reinstate each as dead. The CancelDelta redoAction finally needs to reverse the steps again.

Figure 6: A complex cancel scenario

There will also be scenarios where the same CancelDelta needs to manage both the undoing of actioned but unsaved Delta Objects and the resurrecting of dead Delta Objects (Figure 6). Luckily because new Delta Objects are always added to the end of the Delta Chain, and any undone items not saved are dropped, the Undone Dead Delta Objects that need to be Redone will always appear in the Delta Chain immediately before the actioned Delta Objects that need to be undone (ignoring SaveDeltas). As such, in my implementation of the CancelDelta doAction method I keep two counters, (1) the number of unsaved deltas I reverse, and (2) the number of dead deltas I resurrect. My CancelDelta undoAction and redoAction methods use these two counters to ensure they unwind and rewind the Deltas in the correct ways and order.

Summary

With all of this in place I was finally ready to start building the real user actions! But you’ll be pleased to know that this became the easy part. I just needed to ensure that each new Delta Object’s doAction, undoAction and redoAction methods left the local cached model in the correct respective states, and that the generateXMLDeltas method implemented the correct XML deltas and reversals based on the current state of that Delta Object. The framework I’d built handled the rest for me. Before long our whole application had seamless Undo/Redo functionality and I was very pleased I’d put the effort in up front to get it right.

doug@jodoro.com

[1] We chose to build the front end in Adobe Flex primarily for its rich set of user interface features, and for the ease of distribution to our potential customer base through the Flash plug-in.

Corporate OpenID

2008-11-25T16:51:00.006+11:00

We'd like to use OpenID in the solution we're developing.

There are currently a large number of OpenID providers, including AOL, the BBC, Google and Yahoo! ... and a lot of individuals will already have accounts with one of these providers. However, we'd also like to offer the same experience to corporate users.

The solution to this seems pretty obvious - corporates should become OpenID providers.

It should also be possible to do this in quite a secure fashion. When an individual uses Open ID to a login site, a few key things happen:

The site contacts your OpenID provider and works out (a) the login location and (b) establishes a secret key for this session.

The user is then redirected to the login location to enter their username and password. This is page of the provider (e.g. GMail), not the site you're actually trying to access. As such, your password is kept between you and the provider alone.

If you're successful, you're send back to the site. The key is used to verify that you've signed in with the provider.

There is clearly a bit more to it than that - you can opt in and out of various things for example - but that's the basic dialog.

There is nothing stopping corporates becoming OpenID providers for themselves. To achieve this, they would put a system in their DMZ to interact with relying parties (i.e. the sites using the OpenID).

The "sign-in" page could be established at a location internal to their network (this is not necessary and is perhaps limiting, but it would increase security). As such, when you hit an external website, you'd be redirected to an internal site to actually log in. The employee's username and password never actually leave the internal network, encrypted or otherwise. This login process would also be (more) difficult to spoof or phish, and remain quite resistant to a lot of DNS attacks.

Even better, such a solution could be single signon (SSO) - using the employee's login session at their workstation, rather than requiring them to enter their username and password again.

Sun already do something similar to all of this- they provide OpenIDs to their employees via OpenID at Work. Although from what I can tell, this is a separate identity, rather than being linked to internal corporate ones.

Microsoft is a big supporter of OpenID - For example, Windows Live will support Open ID. However, I can't find any specific literature regarding a turn-key "Turn your Active Directory into an OpenID provider". As many corporates rely on Active Directory, this kind of solution would be a rapid enabler.

If anyone knows of solutions or initiatives looking into this, drop me a line.

...

jon@jodoro.com

Burn your CPU Cycles

2008-09-19T21:39:00.013+10:00

Doug (pictured) and I are working on a few new technical approaches at the moment, aiming to meet some stringent functional targets we've set for ourselves [1]. This has led to some curious insights, especially around how our approach is affected by some ingrained prejudices.

Whilst I'm hardly near being tagged with elder-statesmen euphemism, I must admit that I've got an antiquated aversion to certain types of waste - wasted CPU cycles and wasted memory [2]. This is perhaps a side-effect of coming into technology via embedded programming. Either way, even after years of working in and around resource-hungry Java environments, I've never shaken the feeling that it's somehow lazy or wrong.

Paul Graham makes reference to good and bad waste in his essay, The Hundred-Year Language. He argues about the beneficial tradeoff of giving up some performance for elegance. We certainly subscribe to this. However, it's sometimes hard to narrow down the good from the bad [3]. For me, a good example is batch versus online (realtime) processes. If you've ever worked in technology at a large organisation, you've probably seen your share of batch processing.

Now, there are good reasons for batch processes. However, in my experience, it's more often a form of institutional laziness. Online integration is complex, harder to test and usually much more expensive to develop (upfront cost). A simple batch system is really simple. On the other hand, batch systems can become operational nightmares (ongoing cost). Compounding this, once you get beyond a certain threshold with batch processes the complexity goes through the roof [4, 5]. You can end up processing a lot of data that hasn't changed and simply didn't need processing. Even so, organisations still plough on with batch systems.

However, there is another key to all of this that is sometimes overlooked - your batch process is probably leading to a poor end user experience.

If you've ever requested something and been told "Sorry, it won't be processed until the next business day", then there is probably a batch system somewhere to blame. At one organisation I was at, you could change you address online (great!), but it took three days for this to actually be trickle down to all the necessary downstream systems. You can probably guess that this regularly led to some unhappy customers.

When Doug and I are working on these targets, we have come up with few ideas that will be processing intensive. The instant instinct was always to think "no way, that'll use a whole lot of resources". Usually followed by "... what a waste.". However, is it really a waste? I've come to the conclusion that this is not the right way to look at the problem.

This type of mentality locks you into giving users what they have today. You're defining parameters around the user's experience that align with some (subconscious) mental model of a level of resource consumption. The fact is that the next generation of successful applications will be the ones giving users more... And to give them more, you're going to have to crank up the output.

In my view, a lot of developers are overly fixated with performance when they should have a focus on scale. Performance is important, but scale is just fundamental. If you can give your users a better experience, and your application can scale out - then it's easy to see that you're probably on a winner [6].

If you want to add that Web 2.0 tagging system to your widget, then you're going to have to user more resources. Working on a better search? Well, I think it would be a certain bet that it's going to be using a lot more resource. I recall when I first saw YouTube my immediate reaction was "they're crazy - this is going to crash and burn - they're just going to waste a whole lot of money on bandwidth". My fixation with the bandwidth consumption blinded me to the fact that it was giving people something that they really wanted [7].

So we have a new rule here - the computer works for the user. If we hit an example where we're using more resources, but improving the end user experience as a result - we take that as a sign that we're on the right track. We even push it where we can - We look for opportunities to use more resources, asking ourselves if it's improving the outcome for the user.

... Then the challenge for us is to make it scale.

jon@jodoro.com

[1] Partially to give ourselves a stretch target to broaden our thinking, but primarily to attain a target user experience.
[2] This reminds me of my Grandmother and her Post-WWII frugality. She couldn't tolerate waste of any kind. Much to her annoyance, I could never like Bubble and Squeak.
[3] If you've ever worked at a project-driven organisation you'll be aware of the bias to run a project cheaply, usually ignoring the higher long term cost.
[4] When I was at a large UK bank the batch used to run overnight - It was a pretty busy batch window - after everything ran overnight they only had 14 minutes to spare, so god help them if something went wrong. These same batch windows were also an impediment to doing simple things - like full 24/7 online banking, or opening branches on a Saturday.
[5] I also recall some developers who used to maintain a large batch operation. They started out using Microsoft Project to control all the windows and the dependencies (using the leveling feature). Eventually it got so complex they had to write a bespoke tool to maintain the schedules - let alone the complexity of actually monitoring and maintaining it.
[6] Naturally, there is a cost-benefit ratio coming in there somewhere. If you've ever needed to fix an application that doesn't scale, as opposed to one that is just plain resource hungry then you'll know there is a world of difference.
[7] I had a similarly incorrect first reaction to Flickr.

Patents explained for Start Ups

2008-09-11T00:28:00.010+10:00

A couple of weeks back Jon and I visited an Intellectual Property Lawyer to discuss the steps required to take out some patents on our technology concepts. Apart from the token IP Law education they squeezed into my degree all those many years back, I have never ventured into this side of business before. So I found this process pretty enlightening. [1]

If you are seriously considering patenting, do find a patent attorney and have a chat (yes, specifically patent attorney as opposed to intellectual property lawyer). I say this both as a disclaimer (IANAL), but much more than that, it's simply a good idea. Patent and IP Lawyers are generally experienced, clever, well-connected people and can offer good advice and referrals on a lot of topics. [2]

The first consultation is usually free (in fact let the alarm bells ring if they try to charge you). Payment generally only starts when you decide to go ahead with the process. [3]

The first thing our lawyer asked was if we were ethically opposed to patents, especially software patents. Irrespective of the answer, the advice was still to consider applying – if for no other reason than to provide protection against aggressive competitors or "patent trolls"[4]. One could, I guess, argue that this is what any lawyer is going to tell you. Even so, anyone in the industry would probably be aware that it still makes a lot of sense. He then proceeded to walk us through the different stages of registering a patent, and the different options we had – including when and roughly how much money we would need to fork over.

The basic premise of a patent is that you are disclosing your invention to the state in exchange for exclusive rights. [5] To do this you need to demonstrate that you have innovated in your chosen field beyond what would be considered a normal logical step.

Simply having your patent submission accepted doesn’t necessarily mean that you have exclusive rights to your invention. It can still be rejected by the courts. This is due to two reasons:

Someone else may have lodged the same or similar invention before you. The patent approvers do not trawl through all of the existing patents to determine if you are the first, this is your responsibility – and that of your competitors. Google have recently implemented Google Patents, a searchable data store of all United States patents.

Ultimately the courts may find the patent unenforceable. Patents are actually written as a series of claims. The first couple are typically very broad sweeping claims that courts are unlikely to uphold; the claims then start to narrow in on the truly innovative parts of your patent becoming more and more specific. This is done in the hope that the higher claims will be upheld, whilst still providing fallback protection if they are not.

Whilst the term ‘international patent’ occasionally seems to pop up, in reality it is not possible to establish a ‘world wide’ patent. Patents are granted on a per territory basis. Not only that, the rules as to what can and can’t be patented, how the patent itself needs to be presented, and what exclusive rights you can derive from the patent differs from territory to territory.

Much of the fundamental basis of the patent system is similar in many territories. Clearly, the broad aims of patent law in the United States and European Union are very similar – However, even when the basis of the patent law is very similar, there can often be a lot of variance in the interpretation and bias of a particular system.

In the United States, for example, patent applicants can typically secure a patent by demonstrating that their application is inventive. By contrast in the European Union, there is a stronger focus on demonstrating the patent solves a real world problem. Australia has traditionally followed the US system, but is now tending to adopt the approaches of the EU. There are other differences too. In the EU the significant date in resolving conflicts is the date of submission, while in the US it is the date of invention.

Ultimately, if you need International Patent protection, and many Internet-centric technologies do, you have no choice but to individually apply for patents in each country that you wish to secure exclusive rights to your invention. Each country has a differing cost associated to them (It varies considerably, but circa several thousand dollars US.[6] )

Much of the expense is to have a patent attorney draw up the draft patent to specifically meet that country’s requirements, and to pay for an official to prosecute the patent. However, you might incur other costs too, such as requiring official translations of your documents.

There are some options though on how you go about this, and those options have differing time periods, and differing cost structures.

You can simply apply for individual patents in the specific countries you choose. This is the cheaper option if you only wish to secure patents in a few countries. Following this path you would typically look at with a 12 month period before the patent is prosecuted and your payments are due.

There is also an option to follow a path that uses the process defined in the Patent Cooperation Treaty. [7]
This isn't an ‘international patent’, but a normalised processed for applying for patents in multiple countries. Amongst other things, it allows you to set up on priority date (date of filing) across all the participating countries. [8]

This PCT process is initially more expensive (circa US$5,000 - 10,000). However, the normalised process pushes out the need for individual country patent prosecution decisions and payments by a further 18 months.

So whilst there may be a larger up-front cost, you actually buy yourself some time. If you're a start-up, this can be invaluable. It allows you to obtain a level of IP protection across all PCT countries without going through the expensive process of prosecuting these patents. This time could be invaluable for securing Venture Capital.

However, there are also other benefits. The international patent reviewers will highlight all of the common problems patents hit for the differing countries. This means when the patents are sent to the individual countries to be prosecuted it is far less likely you will see rejections, hence potentially saving you money on resubmissions. The international patent reviewers can also advice that you are unlikely to be able to secure patents in certain territories, saving you the cost of these submissions. The other advantage is that it provides you another point in the process in which you can make modifications to your patent, which may help if the specific real world problem you are solving has slightly changed since your original submission.

doug@jodoro.com

[1] Jon's got a little more experience, he filed for a patent back in 2000.
[2] For example, your company formation might be pretty key – especially in an IP context.
[3] Plus especially if they find your idea promising they’ll be jumping at the chance to provide a promising new IT start up some early favours.
[4] See http://www.paulgraham.com/softwarepatents.html
[5] For more details on the definition of a patent: http://en.wikipedia.org/wiki/Patent
[6] Japan is apparently one of them most expensive states to secure a patent in primarily due to the increased costs of compulsory official translation.
[7] http://en.wikipedia.org/wiki/Patent_Cooperation_Treaty
[8] The vast majority of large economies are part of the PCT ( http://www.bpmlegal.com/pctco.html). There is a list of non-members here ( http://www.bpmlegal.com/notpct.html).

Genealogy the Ontological Problem

2008-09-07T14:07:00.005+10:00

Genealogy is a contrast. It is a very personal experience and endeavour. The majority of people are interested in their own family, their own ties and their own history. Ultimately, however, this history is shared. As individuals investigate further back into their ancestry, the broader the common ground becomes.

We are also interested in Genealogy - because it's an both a fancinating space, and it's one that exercises a lot of interesting problems.

There are a number of sites dedicated to Genealogy for some time, but the majority of these are forums in which people collaborate (e.g. "Anyone know of a John Browne in Manchester, England, sometime around 1855"). There are also emerging sites that let you build family trees, but these are generally private trees, or limited to collaboration in family groups. TechCrunch reported on that a "war" is developing in the genealogy space [1].

Speaking very generally, a family tree is a nodal representation of relationships between individuals. The emphasis, naturally, is on genetic ties. The key relationships are Marriage (or more correctly, parents) and parentage. These ties link the individuals in the family tree.

This is relatively simple so far. However, there are more complex relationships, even with this base simple set of relationships. For example, you can infer "great" relationships (ancestry). As you add each "great", this increases exponentially. There are sibling relationships and other more specialised scenarios - such as half-siblings, step-sibilings, twins, or adoption. In modern times you now have the possibility of same-sex marriage, surrogate pregnancy or sperm and egg donation. There are also other cases, which could sometimes be skeletons in a family tree - Multiple marriages, incest, adultery. You wouldn't need to go far back in many family histories to find someone who vanished (and perhaps had another family unknown to both sides), was disowned or simply just forgotten.

These can all be accommodated in most traditional data models. However, the real complexity is that family trees are still personal and can disagree. This may be as simple as a disagreement over some base factual information such as a name (e.g. Doug vs Douglas, or Smith vs Smithe). It is considerably more complex when there are more structural differences, such as disagreements over a parentage, or and entire lineage.

This is hard to handle using traditional data models. A lot of approaches take the tack of a "conflict resolution" mode - much like source control. However, this is inadequate. The fact is, a lot of these conflicts will never be resolved. Someone's Aunt may never agree with such-and-such's Uncle. You can simply replicate all the information in each family tree, but you're creating a lot of redundant data and (severely) limiting the utility of this information. This approach simply devalues the power of the information when people do agree.

To combine this information using a single repository requires a functional and data model that is exceptionally flexible. It's somewhat clear that it's approaching what is often called the "ontology problem" [2]. Ontologies and the Taxonomies are key to many (all) information domains, and absolutely fundamental to modern information systems.

If you are managing any kind of knowledge, getting the ontology right is pretty important. If you've ever tried to classify or put something into a hierarchy, then it's likely you've hit this complication. Object-Orientated development certainly falls into this space. For example, I have a Van here, and it's classified as Vehicle, but what happens when I have a Motorhome? Or an Amphibious Vehicle? Or an Amphibious Motorhome? If I'm working in a bookstore, do I put a book under fantasy, crime or literature? It might fit in all three.

In these cases, there is no correct answer. You end up with multiple classifications, all of which are correct. Just like genealogy it depends on the context. The problem with ontologies is that they can be extremely difficult to define, and like Family Trees, they are complex, recursive, inter-dependent beasts.

When you look at the substance of the Semantic Web, ontologies and taxonomies are absolutely key. You can't semantically link things together unless the ends agree on this key ontological information [3]. It would be impossible to search for "Motorhomes" on the Semantic Web if different sources classify this information in a completely different way. This classification might work in some contexts, but not others. You might end up with islands of information that aren't compatible and cannot be interconnected - the exact opposite of what the Semantic Web is trying to achieve.

This is why we see genealogy as a generalisable problem. Crack some problems in the genealogy space and you might be solving some fundamentals for the Semantic Web - and vice versa.

jon@jodoro.com and doug@jodoro.com

[1] See http://www.techcrunch.com/2008/09/03/genis-quest-toward-one-world-family-tree/ and http://www.techcrunch.com/2008/09/06/family-tree-wars-continue-myheritage-raises-big-round-shows-impressive-growth/.
[2] It might be worthwhile looking at the Google search for this term, http://www.google.com/search?q=ontology+problem.
[3] See http://novaspivack.typepad.com/nova_spivacks_weblog/2004/11/the_ontology_pr.html

Working towards a Vertical Language

2008-09-04T04:25:00.006+10:00

As part of our concept work here at Jodoro, we are working on a computing language, which we internally call Teme.

Anecdotally, there appear to be a lot of languages. The memory is probably a bit sketchy by now, but I recall from my University lecture days, there were circa 100 viable High Level Languages in the late 1970s. Today this would is in 1,000s, depending on how you'd classify a viable language. Anecdotally, in my own work I'm increasingly aware (and somewhat overwhelmed) of the number of language and technology options that are available [1].

Personally, I think the increase in computing languages in the recent era boils down to a few factors:

(Ever) increasing availability of processing power makes dynamic languages more accessible, and is increasing the (broader) interest and activity in this domain.

Environments such as Eclipse allow make it very easy to create basic tooling for a new language. Adequate tooling is a common impediment for a language gaining adoption. I could also point at other key examples, such as Apache Ant or the LLVM project.

I'm sure some pundits will disagree, but I believe dominance of languages such as Java, C# in corporate environments has allowed niche areas to develop.

The Open Source Movement has greatly enriched the amount of libraries and useful code that is available. This can give languages the "shoulder of giants" kickstart they need. There are innumerable examples, but LLVM is another great example for this.

We do more computing nowadays - and the computing we do is more novel.

... I think that some observers can underestimate how significant the last point is. As the computing world evolves, languages are changing in their focus and capability. A computing language is, first and foremost, a means of getting a human to tell a computer what to do. Most significantly, what we are asking computers to do is broadening day by day.

However, all of this begs the question, what is a language exactly? [2] Any significant software solution will eventually result in what I'd describe as some form or subset of a language - A large SAP or Siebel installation may nominally be running on a Java platform, but you'll need to understand the semantics of SAP or Siebel to actually develop anything. It's possible to argue that many developers will be developing Domain Specific Languages in their solutions without really realising it [3].

Equally, you may look at Microsoft .NET as a set of languages - However, at the same time, these languages all share the same underlying class library. Arguably, the "language" of the class library is a more significant than the specifics of C#, VB, or another .NET language.

This (perceived) mayhem of different languages is compounded by a degree of flux with a number of emerging technologies. For example, a lot of institutions are investing in Business Rule Engines [4]. There are a variety of these engines available, each with a different area of specialisation and their own inherent language and tooling. There also also other technologies that are emerging rapidly in corporate environments - Business Process Execution is another classic example.

With that in mind, you could consider the Google search box a language. I often use "1 USD in AUD" in Google, or "Time in London" (there are dozens more, such as calculations or stock prices). It's a way from a formal grammar, but it's a new language construct that we're all using every day. The nice thing about Google is they obviously have mined and researched these kinds of queries and catered for these common examples. It's a language that is evolving in response to the actions of users.

So why are we developing another one? To accommodate the novel things we want our system to do (novel in our minds in the very least). As Paul Graham points out as part of this Arc Language FAQ - It would be surprising if we didn't still need to design more languages. [5], [6]. If you need to work in domain that is any way novel, there is a good chance you need to start thinking about the language constructs required.

There are a number of specific reasons why someone might start developing a language in earnest. For example, there is a lot of buzz around the Semantic Web at the moment. A lot of the effort in this area has focused on the SPARQL Query Language. The development of this language, and other standards, are absolutely fundamental. For us, there were a few drivers in taking starting to look at our own language:

As an intellectual exercise.

To explore a model of computing that is highly parallel and distributed.

To address a particular problem domain that is unapproachable or clumsy in other languages.

In particular, we are interested in increasing the parallelism and distribution of our systems. This is key in constructing a system that can both (1) process large data sets in "user time" as well as (2) be capable of scaling to meet any increase in user demand. My previous article on Map-Reduce discusses one language construct for parallelism - we're keen to take this further and into other processing domains.

Developing in a language is a useful exercise. Developing a language lets you put a lens your work and view it in a different way. If you take your problem and look as a language problem, you see the semantics in a new light. Even if you're working in "another language" such as Java or Ruby - it's still useful to think about the constructs your are building as a language and work towards that. This is a key philosophy to languages such as Lisp [7] , but it's a still a fundamentally useful exercise regardless of your target language.

How to go about this is the next question question - It's very easy to set up the key axioms for a new language or approach, but the difficulty arises as you inevitably encounter trade-offs along the way. In our effort we're continually walking a line between having something ultimately flexible, against having something that remains interoperable.

In later articles I'll write about how we decided to go about this (for better or worse) and some of the motivations and experiences that drove our efforts. I'll also discuss what we mean by a "Vertical Language" and how that is shaping our end goals.

jon@jodoro.com

[1] Eric Lebherz, HyperNews Computer Language List, http://www.hypernews.org/HyperNews/get/computing/lang-list.html, 2005
[2] Trying to define what a language is semantic is perhaps the ultimate tautology, but hopefully you get my point. Wikipedia has an article on Computer Languages that is less philosophical.
[3] This is perhaps expressed more succinctly by Greenspun's Tenth Rule
[4] I've personally come across three Business Rule Engines in the same organisation - ILOG JRules, Fair Isaac Blaze and Experian. All very different rules engines, with their own language definitions.
[5] Paul Graham, Arc Language FAQ, http://www.paulgraham.com/arcfaq.html, Year Unspecified
[6] Paul Graham also has a great essay called the The Hundred-Year Language that is well worth reading.
[7] Lisp pundits call Lisp a "Meta-Language" for this very reason.

Looping in distributed programming

2008-09-03T21:06:00.017+10:00

for (init_expression; loop_condition; loop_expression)
{
        program statement;
}

while (loop_condition)
{
        program statement;
        loop_expression;
}

The simple ‘for’ and ‘while’ loops above have so permeated so many programming languages that it's hard to imagine a programmer that hasn't written them innumerable times. So much so that most programmers would not bother to question its function, it is after all intuitive and self explanatory.[1]

However, as we started to look at parallel distributed execution, we realised the dynamics of these constructs are pivotal. This essay will dare to delve deeper into the simple loop construct to question how to make it perform. In particular I will focus on loop parallelisation.[2]

Historically the vast majority of loop constructs have been written to execute in a single processor, and that is exactly how the creators of the original construct intended us to use it. However, over the last decade we have seen the emergence of some large data sets, not the least of these being the Internet. This provides many classes of problems that require us to loop over such large data sets that a single processor, no matter how fast, is simply not going to return a result within tolerable real-time constraints.
Simultaneously, processor vendors are moving to an increasingly “multicore” strategy. Intel recently told developers to prepared for “thousands of cores”.[3]

There has historically been a happy marriage between the way humans think about problems and the way computers execute them – both, even if multitasking, essentially boil down to linear execution. As such the programmer writing an algorithm typically thinks and writes one step after another. But when it comes to distributed computing, suddenly we are asking computers to work in ways many people cannot – at least not alone. Programming for distributed computing expects the single programmer to derive and (somehow) write an algorithm as if an undefined number of machines would be collaboratively executing it.

With this in mind, it becomes abundantly clear that our timeless loop construct is inadequate for distributed computing. We either need a construct that informs the computer network what can be distributed, or we need an interpreter that can determine at run time what can and can't be distributed. Of course simply because something can be distributed doesn't necessarily mean it should – for example if the data set is small the distribution could be all overhead. However, there is no doubt there are many problems for which this isn’t the case.

Let us first look at the different types of loops to help us understand which can and can't be distributed.

Independent Loops

The first I like to call the Independent Loop. Such loops will execute the same logic/function over an array of data. It might be for example 'scoring' the relevance of a set of data that has been collected. The key to this type of looping is that the only data being modified is the data being looped (or the corresponding element in and identically sized 'result' dataset). An example of this type of looping is:

int data[] = {5, 4, 1, 6, 3};
for (int i = 0; i++; i < data.length)
{
data[i] *= data[i];
}

The advantage of Independent Looping is that it can be maximally distributed. We could comfortably split 'the data' into any number of arbitrary parts and execute the same multiplication 'for' loop across each part. Concatenating the resulting data sets would leave us with an identical result as to if we had executed the data set in the single algorithm.

This type of looping is essentially a form or superset of Map-Reduce as discussed by Jon in his article on "Map-Reduce and Parrallelism" posted in July, 2008.

Accumulative Loops

The next loop type I call the Accumulative Loop. An Accumulative Loop is similar to the Independent Loop in that the function is executed on a single dataset one element at a time. However, rather than (or as well as) modifying the dataset element, the algorithm is modifying at least one variable outside the loop. What is key in this type of loop is that the sequence of modification to the external element isn’t important. This is an important distinction as if the outcome is dependent on the order of execution, then the algorithm is not deterministic in a distributed environment.
An example of a use of this kind of loop is to sum the values of a dataset:

int data[] = {5, 4, 1, 6, 3};
long sum = 0;
for (int i = 0; i++; i < data.length)
{
sum += data[i];
}

As with the Independent Loop this kind of loop can be distributed. Care obviously needs to be made to ensure the various versions of the externally updated variables are ultimately reunited.

Dependent Loops

The last type of loop I'd like to introduce is the Dependent Loop. This is one where the result of one iteration is dependent on the last, or the outcome of the loop is dependent on the order of execution. The dependency may be data, or control related. A data dependent example is:

int data[] = {5, 4, 1, 6, 3};
for (int i = 1; i++; i < data.length)
{
        data[i] += data[i-1];
}

One such control related dependency may even be the decision whether to execute the next iteration. This is common in while loops, for example the conventional algorithm for deriving the greatest common devisor of two numbers u and v:

while (v != 0)
{
        temp = u % v;
        u = v;
        v = temp;
}

When the result of the next iteration of the loop relies upon the former we cannot distribute the execution. Instead, if possible, the code should be rewritten to be Independent or Accumulative.
An interesting observation is that the boundary case of infinite looping is a class of Dependent Loop. This is because the Independent and Accumulative Loops iterate over a predefined dataset. The infinite loop is a particularly dangerous boundary case in distributed programming as if released undetected it could literally consume all of the resources. Most applications will have an infinite loop somewhere, generally as a message processing construct (equivalent to the while(true) { } in your Java Runnable) - but these are implictly dependent because generally these loops will be consuming some kind of resource (i.e. a message queue or similar).

Of interest to me was that all three types of algorithms can be written using 'for' loops (or even 'while' loops), and hence the construct is actually too flexible to be useful in distributed programming. If for example we have:

for (...) { blah.doSomething() }

It is very difficult to determine whether this loop is Dependent, Accumulative or Independent. This may be even more difficult in a dynamic functional language such as Ruby or Lisp as you might be using function pointers that are unknown at compile time. [4]

Instead in our development we have introduced new constructs in our distributed language that allow the programmer and the computer to cleanly separate these types. The programmer must specify their intention at design time and this significantly simplifies the compiler's task of determining which loops will be distributed and which will not.

doug@jodoro.com
...

[1] The Do loop was first introduced in FORTRAN 66 in 1966. See: http://en.wikipedia.org/wiki/For_loop#Timeline_of_for_loop_in_various_programming_languages
[2] For details on other loop optimisation techniques take a look at http://en.wikipedia.org/wiki/Loop_transformation.
[3] See http://news.cnet.com/8301-13924_3-9981760-64.html
[4] This same issue is referenced in Jon’s Map-Reduce and Parallelism article.

Defining Cloud Computing

2008-07-20T22:27:00.009+10:00

Generally, when a new fad [1] comes along in the software industry there is a fantastic initial rush of enthusiasm, buoyed by the initial success and enthusiasm of the innovators...

...This is generally followed by a series of attempts to define just what the fad is. We've seen this in recent history with topics such as Service Orientated Architecture (SOA) or Web 2.0 - or even things that seem very specific at the outset, like AJAX. There are countless other examples going back further.

This is now the case with the term Cloud Computing. Cloud Computing is difficult to define, but it's very generally about the commoditisation of computing resource. Instead of buying a server, with a specific processing and storage capacity - you simply buy an allocation of capacity from a vendor, a capacity that you can then increase or decrease as necessary. Instead of having a number of discrete resources (e.g. Ten disks on ten different servers), you get access to storage as one, virtual service.

It seems a simple proposition from many perspectives, but in reality the distinction is pretty subtle. After all, buying a physical server in principle achieves the same result - you pay and get more capacity. However, the real difference is in the definition of the service you get, and the mechanics for accessing that service - which is where the definition gets complicated.

Recently on Slashdot (Multiple Experts Try Defining "Cloud Computing"), there was an reference to a SYS-CON article on the definition of Cloud Computing - Twenty Experts Define Cloud Computing [2].

So Cloud Computing is not unique in that is lacks an explicit boundary. Definitions in the SYS-CON Article [2] varied quite a bit. Some of the commentary included:

..the broad concept of using the internet to allow people to access technology-enabled services
Most computer savvy folks actually have a pretty good idea of what the term "cloud computing" means: outsourced, pay-as-you-go, on-demand, somewhere in the Internet, etc.
Clouds are the new Web 2.0. Nice marketing shine on top of existing technology. Remember back when every company threw some ajax on their site and said "Ta da! We’re a web 2.0 company now!"? Same story, new buzz word..

One of the definitions I particularly liked was by Yan Pritzker [3]. This definition makes three key points (refer to the article for the full explanation):

Clouds are vast resource pools with on-demand resource allocation.
Clouds are virtualized.
Clouds tend to be priced like utilities.

All of these definitions are generally from respected pundits and enthusiasts in the area. In the formative (storming) stage of a technology these are often very insightful and useful. However, as the audience and participants grow, this is diluted very quickly. In my experience, the definition of a fad is (eventually) corrupted by these definitions - In particular, as software and hardware vendors inevitably take control of the spotlight.

It's very common in my experience to see vendors come in with the following line - (a) here is this great fad and how it's changing the world, (b) it's obviously complex, so here is a definition of it that you can digest - and (c) here is our our products... Coincidentally, our products fit that definition exactly.

This is perhaps cynical. It's perfectly natural for a vendor to define their strategy in line with industry thinking and then produce their product suite to fit. Arguably, IBM know quite a bit about Cloud Computing. IBM has been pushing the "On Demand" brand across their various offerings since 2002 [4]. This is not to say that IBM's On Demand necessarily fits some of the definitions of Cloud Computing [5] - but it's certain that IBM will fold the buzz Cloud Computing into it's strategy around On Demand [6].

However, the cynicism around the action of vendors in this space is not unfounded. There are a lot of products that suddenly become SOA or BPM or AJAX once the term becomes a hot property in the industry. Or you simply take an existing product suite, add another product or extension that fits the mould, and rebrand the lot. This also highlights one good strategy for a startup - Find an existing, established vendor, spot some kind of technology gap or inconsistency, develop against the gap and get acquired.

There is an underlying reason that this definition exercise is possible. To an outside party it may be clear that a new approach or technology has potential. This is driven by the outstanding success of the innovators and early adopters. This success often generates a justified amount of buzz in the industry. So there is potential, and a new way of doing things - and generally organisations have problems to solve. So the question inevitably arises - "This sounds good, but what can it do for me?".

Whilst this generally results in a bewildering number of answers, usually designed to push product, the fact is that often the cornerstones underlying these fads are relatively straightforward. SOA really has it's roots in Web Services - and although the two are theoretically independent, they have walked hand in hand through the hype cycle. Every presentation you will see will belie this fact, in fact they will often state the exact opposite "SOA is not Web Services" [7]. This is somewhat true, but this often leads to a definition that is a tautology.

These snippets usually get entangled for good reason. The fact is, just adopting Web Services (or whatever equivalent) simply isn't the end of the story. In fact, it's often just the first step in thousands. You have Services now, so you need to lifecycle manage them, instrument them, align them to Business Processes to get "true business benefit and reusabily", etc, etc. Most of these aren't new problems at all, but the componentisation around the original Service approach (the Web Service) has changed the unit of work, and changed how a lot of moving parts are now expected to interconnect - software, hardware, processes and people. So something like SOA moves in to fill these gaps, which clearly can be nebulous in the extreme.

So what you have is a shift in technology that quickly becomes a bigger beast. Now that you have this technology, you need the consummate shifts in the methods, tools and outlooks that support this technology. Sometimes these shifts are on the periphery, sometimes they are orders of magnitude greater than the original technology driver. This is where the definitions become nebulous. e.g. "Now that I'm building Services, which ones should I build and how do I maintain them?"

In that sense, Cloud Computing falls into a similar trap. Accepting the fundamental technology premise around Cloud Computing opens up a lot of questions. Cloud Computing may be quite simple technically, but there are a whole raft of other implications that you need to deal with - many of which are old ground. I'm sure Mainframe pundits look at concepts such as SOA and Cloud Computing and say "but we've already solved that". In fact many cynics I speak to in the industry label these initiatives as the industry's desire to get back to the Mainframe - arguable, but you can usually see their point. CORBA pundits are probably part of a similar chorus.

Irrespective, something like Cloud Computing puts these combinations together in unique ways. So whilst you're treading old ground, some of the solutions might not necessarily be the same. For example, one of the keys in Cloud Computing could be achieving a cost basis that drives economies of scale.

So, to contradict myself and provide a definition - here are some of the technology keys that I see as driving Cloud Computing:

Even thought it might be transparent to an end user, Cloud Computing is run on "Commodity Hardware". This is a loose definition in itself, but generally means off-the-shelf x86 servers, often running Linux.
Further to this, instead of a handful of Big Iron servers, Cloud Computing tends to rely on a large cluster of these commodity machines.
Cloud Computing relies on increases in bandwidth to allow access to processing and storage resources via the Internet.

What this generally means is that:

Your computing services are cheaper because they are using cheaper hardware.
Your computing services are cheaper because utilisation is higher - if you're not using your capacity, someone else probably is.
It's easier to scale, as you are tapping the bigger resource pool that a single vendor can offer [8].
The Quality of Service (e.g. Reliability) is tuned by the inherent capacity and redundancy of the cloud. The systems aren't specified to be ultra-reliable, the resilience of a cloud is driven by having redundant nodes to take over the in the result of failure.
You were reliant on external parties before (e.g. ISP, Web Host), but you're increasingly reliant on them "higher up the stack".
Cloud Computing relies less on references that refer to physical assets (e.g. a fileshare on a specific server), but instead virtualises these for the customer. You just need a reference to the file, and a place to ask for it - you don't care where it actually resides.

Whilst most of these technology shifts are all about "Business Benefit" of some sort, it's often useful to take a concept back to it's absolute fundamentals to put it in perspective. Much of the surrounding definition often points to a philosophy or method - underlying all of this is often a series of technical shifts that is prompting this.

Going back to the technology definition can obscure some of the more conceptual benefits, but it also gives perspective - many of the pitfalls with a new technology have already been encountered before. The SYS-CON [2] links to a Paul Wallis' definition [9] highlights this fact further.

...And perhaps most of all, the key technologies are simply the easiest part to nail down.

jon@jodoro.com
...

[1] Fad is perhaps a bit of an unfair term - A fad could actually be a very significant event.
[2] Twenty Experts Define Cloud Computing, http://cloudcomputing.sys-con.com/read/612375_p.htm
[3] Yan Pritzker, Defining Cloud Computing, http://virtualization.sys-con.com/read/595685.htm
[4] J. Hoskins, B Woolf, IBM on Demand Technology Made Simple, 3rd Ed, Clear Horizon
[5] IBM's On Demand Strategy is hard to pin down specifically, it is a horizontal branding applied across a lot of the IBM silos. A search on the ibm.com site returns over 100,000 hits alone. Also see Google Search: http://www.google.com.au/search?q=ibm+%22on+demand%22
[6] M. Darcy, IBM Opens Africa's First "Cloud Computing" Center, Second Cloud Center in China, http://www-03.ibm.com/press/us/en/pressrelease/24508.wss
[7] To an extent this is absolutely true, but it isn't really true in practice. Standards under most SOA definitions are key, and Web Services are absolutely the cornerstone of this.
[8] However, just because your infrastructure scales, this doesn't mean your specific solution on that infrastructure does.
[9] P. Wallis, "A Brief History of Cloud Computing: Is the Cloud There Yet?", http://cloudcomputing.sys-con.com/read/581838.htm.

Map Reduce and Parallelism

2008-07-13T13:47:00.003+10:00

Map-Reduce is a powerful mechanism for parallel execution. Originally a set of semantics used in functional languages, map-reduce is now heavily employed in processing information in a variety of clustered environments. It's used by Google a lot, and Google has their own framework for handling map-reduce [1].

As would be evident, map-reduce is made up of two key concepts:

1. Map takes a list of data and applies an operation or transformation to each element of the list to produce another list of results. The general implication here is the results list is of the same magnitude as the source list. For example if you had a list of 1,000 numbers, after the map you'd have another list of 1,000 elements (be they numbers or not).
See: http://en.wikipedia.org/wiki/Map_(higher-order_function)

2. Reduce takes the list of results and compiles them in some fashion. Unlike Map, there is some kind of expectation of "reduction" or derivation of the data - that is if you had a list of 1,000, the result might be a list of 100, or a single number [2]. So a reduce could be anything from summing all the elements, to sorting them, or cutting them back to the top 100 - or any variant therein.
See: http://en.wikipedia.org/wiki/Fold_(higher-order_function)

Google list a number of advantages to the use of their map-reduce framework [1]:
- Automatic parallelisation and distribution
- Fault-tolerance
- I/O scheduling
- Status and monitoring

Most of these are operational benefits. However, the real benefit to Google in using map-reduce is within the first statement, the automatic parallelisation and distribution of processing. This is particularly important when processing very large data sets and/or responding to a user request - a user will only wait so long for a response. A user isn't going to wait ten minutes for a web search to return. So Google run the same search spread across thousands of machines, giving you a response in seconds (in reality, sub-second).

A fairly accessible example of a map-reduce operation is this kind of search. In this case, Map would take a function, such as "Score" and apply it to a large list of data - a large list of webpages. This score is representative of how well the webpage matches some criteria, such as search terms.

Reduce takes one or more list of scores and performs a "fold" or reduce. In this example, it would do something like take the list of scores and cut it back to the top 100, sorted from highest to lowest. This example of a reduce always produces 100 or less scores - give it 50 and it will produce 50 results (sorted), give it 1,000 and it will produce a list of the top 100 results (again, sorted).

For example, if I'm searching for "Jodoro", the Map looks at the 1,000 pages available and using the scoring operation it gives each page a score for the occurrence of "Jodoro" on the page. Reduce then looks at these 1,000 scores and whittles it back to the top 100. The pseudocode for this might look like:


Define MapReduce(Pages, Score, Reduce)
   Scores = Map(Pages)
   Result = Reduce(Scores)
   Return Result
End

Define Map(Pages, Score)
   Create a new list Scores
   For each Page in Pages
       PageScore = Score(Page)
       Add PageScore to Scores
   Next
End

This is all relatively pedestrian, but the power of the a Map-Reduce is that these can be performed in parallel. To illustrate this, here is a high level piece of pseudo-code:


Define MapReduce(Pages, Score, Reduce)
   Split Pages into 4 Lists, producing FirstPages, SecondPages, ThirdPages, FourthPages
 
   FirstScores = Map(FirstPages)
   SecondScores = Map(SecondPages)
   ThirdScores = Map(ThirdPages)
   FourthScores = Map(FourthPages)
 
   FinalScore = Reduce(FirstScores, SecondScores, ThirdScores, FourthScores)

   Return FinalScores
End

This map-reduce operation splits the scoring process into four pieces that are individually mapped, then ranked using the reduce function. So if you had 400 pages, you would end up with four lists of scores each 100 long (FirstScores through FourthScores). The Reduce function takes these four 100 scores and produces a list of the top 100.

It's probably worth pointing out at this stage that this is a complete search. We've not used heuristics, we've examined every page and determined the "best" result. We could have cheated. For example if a page scored very low, we might not have bothered to include it in the scored list [3]. Whilst these heuristics are useful, they are not guaranteed. However, in specific cases such as search, it's probably appropriate to discard very low results.

In the case of this example, another variant might be:


   ...
   FirstScores = Score(FirstPages)
   SecondScores = Score(SecondPages)
   Scores1 = Score(FirstScores, SecondScores)

   ThirdScores = Score(ThirdPages)
   FourthScores = Score(FourthPages)
   Scores2 = Score(ThirdScores, FourthScores)

   FinalScore = Reduce(Scores1, Scores2)
   ...

In this case, we Reduce three times. First we take the first two results and combine them, then the second two - and finally we take the two aggregates to produce a final result.

So what does this all mean? Well, the key is that the mapping piece can occur entirely in parallel - or to be more correct, the scoring can occur in parallel. All of the four "Score" pieces above could occur independently. So you could run the four Score pieces on four different systems in order to process them faster. Google MapReduce or the open source Hadoop framework [4] take this parallelism a step further. Instead of breaking the Pages into four (or whatever) arbitrary pieces, they create a large pool of servers - each with their own set of Pages [5]. When you invoke a map-reduce, a Master server asks all of the servers to undertake the individual map on their particular data, producing the score or otherwise derived result. These scores are then "reduced" to the actual list of results [6]. So the split in the example pseudocode is actually achieved by partitioning the data across the multiple systems.

All this seems pretty straightforward and useful. However, there is a sting in the tail. What isn't made clear by this explanation is a key underlying premise - That the individual score operations are independent. If your score operation was somehow dependent or interlocked with another score operation, then clearly they cannot operate in parallel. This almost certainly not the case when matching a set of search terms against a webpage - this is exemplified by the search example. A score can occur on a webpage on one system completely independently of the scoring that is occurring on another physically distinct system.

It's probably also worth noting that the same could be said for Reduce - in our examples we showed Reduce being used either once or three times. The presumption is that the Reduce doesn't modify the underlying scores and that there are no other side-effects. If it did, the multiple uses of Reduce may also produce an unexpected result.

Joel Spolsky validly points out that an understanding of functional languages assists with understanding map-reduce [7] as map-reduce fundamentally came from semantics in languages such as Lisp. This is largely true, but the semantics of map-reduce do not necessarily imply parallel execution. Certainly the first functional implementations of map and reduce weren't intended for parallel operation at all.

So adding the right semantics into a language to give you map-reduce doesn't necessarily make the code parallel, whilst it does open up possibilities. In our search example, the structure of the data (Pages) and the independent operation of the mapping operation (Score) is absolutely fundamental to having the map-reduce operate in parallel. The use of map-reduce primarily provides a metaphor where these work together in a logical, consistent way.

If you had a piece of Java code do your scoring, then it's totally feasible that you don't know anything about the scoring implementation (as with a functional language, that's part of the point). However, you'd need to be careful that this piece of code isn't dependent on some kind of resource - it could be manipulating private variables, writing to disk, generating random numbers, or (god forbid) relying on some kind of Singleton instance. For example, your Score may initially generate a random number that is used in all the subsequent scoring operations. If you spread this across four machines you might have the unexpected side-effect that they are operating using a different random number. Whatever the technicalities, in some circumstances it may be very difficult to establish how independent an arbitrary Score implementation is. This is somewhat of an irony - as this kind of abstraction (i.e. an arbitrary score function) is key to a functional language.

As another example, something like the patented Google PageRank [8], relies on the linkages between pages in order to score their importance. What is important here is that the score of a page is derived from the importance of the pages that link to it. In this case, you'd need to be careful about how you built your Score function (to Rank the pages) as the score of an individual page is dependent on the scores of others [9].

So - it isn't the map-reduce that is operating in parallel. It's the means of definition of the source data and the transformation that is the first enabler to parallel operation. You can't necessarily take map-reduce and apply it wholesale to do an existing complex financial risk analysis, for example. There are likely complex algorithmic dependencies that simply cannot be accommodated - certainly at least without recasting the problem or data in a manner that suits the model. A solution to problem might actually require a change in data definition, or might require a the design several map-reduce operations to operate. Or simply, the problem might have so many implicit algorithmic dependencies that map-reduce isn't appropriate at all.

Fundamentally whilst map-reduce is useful at increasing the level of parallelism in your code, it doesn't intrinsically give you parallel processing. What it does give is a common metaphor for this type of processing, enabling the development of a framework - and all the benefits a framework brings. If you need to do parallel processing, you still need to focus on the composition of the problem.

jon@jodoro.com
...

[1] Jeffrey Dean and Sanjay Ghemawat, http://labs.google.com/papers/mapreduce.html, December 2004
[2] For something like a sort it could return the same number of elements. I can't think of a valid example where a Reduce would *increase* the number of elements. So reduce is probably a pretty apt term.
[3] Google would almost certainly do this with a Web search.
[4] See http://hadoop.apache.org/
[5] Reality is that they will have a lot of redundant nodes in order to scale, but this is the basic premise.
[6] In the case of a Google Web Search, the reduce is to reduce the set of results to the top 1,000 results. You never get more than 1,000: For example, http://www.google.com/search?q=the&start=990
[7] Joel Spolsky 'Can Your Programming Language Do This?', http://www.joelonsoftware.com/items/2006/08/01.html, August 2006
[8] See http://www.google.com/corporate/tech.html
[9] Clearly Google has approaches to solve this dependency, but it is useful in illustrating the point.