Sunday, September 7, 2008

Genealogy the Ontological Problem

Genealogy is a contrast. It is a very personal experience and endeavour. The majority of people are interested in their own family, their own ties and their own history. Ultimately, however, this history is shared. As individuals investigate further back into their ancestry, the broader the common ground becomes.

We are also interested in Genealogy - because it's an both a fancinating space, and it's one that exercises a lot of interesting problems.

There are a number of sites dedicated to Genealogy for some time, but the majority of these are forums in which people collaborate (e.g. "Anyone know of a John Browne in Manchester, England, sometime around 1855"). There are also emerging sites that let you build family trees, but these are generally private trees, or limited to collaboration in family groups. TechCrunch reported on that a "war" is developing in the genealogy space [1].

Speaking very generally, a family tree is a nodal representation of relationships between individuals. The emphasis, naturally, is on genetic ties. The key relationships are Marriage (or more correctly, parents) and parentage. These ties link the individuals in the family tree.

This is relatively simple so far. However, there are more complex relationships, even with this base simple set of relationships. For example, you can infer "great" relationships (ancestry). As you add each "great", this increases exponentially. There are sibling relationships and other more specialised scenarios - such as half-siblings, step-sibilings, twins, or adoption. In modern times you now have the possibility of same-sex marriage, surrogate pregnancy or sperm and egg donation. There are also other cases, which could sometimes be skeletons in a family tree - Multiple marriages, incest, adultery. You wouldn't need to go far back in many family histories to find someone who vanished (and perhaps had another family unknown to both sides), was disowned or simply just forgotten.

These can all be accommodated in most traditional data models. However, the real complexity is that family trees are still personal and can disagree. This may be as simple as a disagreement over some base factual information such as a name (e.g. Doug vs Douglas, or Smith vs Smithe). It is considerably more complex when there are more structural differences, such as disagreements over a parentage, or and entire lineage.

This is hard to handle using traditional data models. A lot of approaches take the tack of a "conflict resolution" mode - much like source control. However, this is inadequate. The fact is, a lot of these conflicts will never be resolved. Someone's Aunt may never agree with such-and-such's Uncle. You can simply replicate all the information in each family tree, but you're creating a lot of redundant data and (severely) limiting the utility of this information. This approach simply devalues the power of the information when people do agree.

To combine this information using a single repository requires a functional and data model that is exceptionally flexible. It's somewhat clear that it's approaching what is often called the "ontology problem" [2]. Ontologies and the Taxonomies are key to many (all) information domains, and absolutely fundamental to modern information systems.

If you are managing any kind of knowledge, getting the ontology right is pretty important. If you've ever tried to classify or put something into a hierarchy, then it's likely you've hit this complication. Object-Orientated development certainly falls into this space. For example, I have a Van here, and it's classified as Vehicle, but what happens when I have a Motorhome? Or an Amphibious Vehicle? Or an Amphibious Motorhome? If I'm working in a bookstore, do I put a book under fantasy, crime or literature? It might fit in all three.

In these cases, there is no correct answer. You end up with multiple classifications, all of which are correct. Just like genealogy it depends on the context. The problem with ontologies is that they can be extremely difficult to define, and like Family Trees, they are complex, recursive, inter-dependent beasts.

When you look at the substance of the Semantic Web, ontologies and taxonomies are absolutely key. You can't semantically link things together unless the ends agree on this key ontological information [3]. It would be impossible to search for "Motorhomes" on the Semantic Web if different sources classify this information in a completely different way. This classification might work in some contexts, but not others. You might end up with islands of information that aren't compatible and cannot be interconnected - the exact opposite of what the Semantic Web is trying to achieve.

This is why we see genealogy as a generalisable problem. Crack some problems in the genealogy space and you might be solving some fundamentals for the Semantic Web - and vice versa.

jon@jodoro.com and doug@jodoro.com


[1] See http://www.techcrunch.com/2008/09/03/genis-quest-toward-one-world-family-tree/ and http://www.techcrunch.com/2008/09/06/family-tree-wars-continue-myheritage-raises-big-round-shows-impressive-growth/.
[2] It might be worthwhile looking at the Google search for this term, http://www.google.com/search?q=ontology+problem.
[3] See http://novaspivack.typepad.com/nova_spivacks_weblog/2004/11/the_ontology_pr.html

2 comments:

hong_hai_long said...

Absolutely brilliant post guys, been following your blog for 3 days now and i should say i am starting to like your post. and now how do i subscribe to your blog?
office activation key

Anil Rajanala said...

Wow that was odd. I just wrote an very long comment but after I clicked submit my comment didn’t appear. Grrrr… well I’m not writing all that over again. Anyways, just wanted to say wonderful blog!
Web hosting