Dilbert schematics

How can we package, manage, mix and merge graph datasets that come from different contexts, without getting our data into a terrible mess?

During the last W3C RDF Working Group meeting, we were discussing approaches to packaging up ‘graphs’ of data into useful chunks that can be organized and combined. A related question, one always lurking in the background, was also discussed: how do we deal with data that goes out of date? Sometimes it is better to talk about events rather than changeable characteristics of something. So you might know my date of birth, and that is useful forever; with a bit of math and knowledge of today’s date, you can figure out my current age, whenever needed. So ‘date of birth’ on this measure has an attractive characteristic that isn’t shared by ‘age in years’.

At any point in time, I have at most one ‘age in years’ property; however, you can take two descriptions of me that were at some time true, and merge them to form a messy, self-contradictory description. With this in mind, how far should we be advocating that people model using time-invariant idioms, versus working on better packaging for our data so it is clearer when it was supposed to be true, or which parts might be more volatile?

The following scenario was posted to the RDF group as a way of exploring these tradeoffs. I repeat it here almost unaltered. I often say that RDF describes a simplified – and sometimes over-simplified – cartoon universe. So why not describe a real cartoon universe? Pat Hayes posted an interesting proposal that explores an approach to these problems; since he cited this scenario, I wrote it up as a blog post.

Describing Dilbert: theory and practice

Consider an RDF vocabulary for describing office assignments in the cartoon universe inhabited by Dilbert. Beyond the name, the examples here aren’t tightly linked to the Dilbert cartoon. First I describe the universe, then some ways in which we might summarise what’s going on using RDF graph descriptions. I would love to get a sense for any ‘best practice’ claims here. Personally I see no single best way to deal with this, only different and annoying tradeoffs.

So — this is a fictional highly simplified company in which workers each are assigned to occupy exactly one cubicle, and in which every cubicle has at most one assigned worker. Cubicles may also sometimes be empty.

  • Every 3 months, the Pointy-haired boss has a strategic re-organization, and re-assigns workers to cubicles.
  • He does this in a memo dictated to Dogbert, who will take the boss’s vague and forgetful instructions and compare them to an Excel spreadsheet. This, cleaned up, eventually becomes an emailed Word .doc sent to the all-staff@ mailing list.
  • The word document is basically a table of room moves, it is headed with a date and in bold type “EFFECTIVE IMMEDIATELY”, usually mailed out mid-evening and read by staff the next morning.
  • In practice, employees move their stuff to the new cubicles over the course of a few days; longer if they’re on holiday or off sick. Phone numbers are fixed later, hopefully. As are name badges etc.
  • But generally the move takes place the day after the word file is circulated, and at any one point, a given cubicle can be fairly said to have at most one official occupant worker.

So let’s try to model this in RDF/RDFS/OWL.

First, we can talk about the employees. Let’s make a class, ‘Employee’.

In the company systems, each employee has an ID, which is ‘e-’ plus an integer. Once assigned, these are never re-assigned, even if the employee leaves or dies.

We also need to talk about the office space units, the cubes or ’Cubicles’. Let’s forget for now that the furniture is movable, and treat each Cubicle as if it lasts forever. Maybe they are even somehow symbolic cubicle names, and the furniture that embodies them can be moved around to diferent office locations. But we don’t try modelling that for now.

In the company systems, each cubicle has an ID, which is ‘c-’ plus an integer. Once assigned, these are never re-assigned, even if the cubicle becomes in any sense de-activated.

Let’s represent these as IRIs. Three employees, three cubicles.

  • http://example.com/e-1
  • http://example.com/e-2
  • http://example.com/e-3
  • http://example.com/c-1000
  • http://example.com/c-1001
  • http://example.com/c-1002

We can describe the names of employees. Cubicicles also have informal names. Let’s say that neither change, ever.

  • e-1 name ‘Alice’
  • e-2 name ‘Bob’
  • e-3 name ‘Charlie’
  • c-1000 ‘The Einstein Suite’.
  • c-1001 ‘The doghouse’.
  • c-1002 ‘Helpdesk’.

Describing these in RDF is pretty straightforward.

Let’s now describe room assignments.

At the beginning of 2011 Alice (e-1) is in c-1000; Bob (e-2) is in c-1001; Charlie (e-3) is in c-1002. How can we represent this in RDF?

We define an RDF/RDFS/OWL relationship type aka property, called eg:hasCubicle

Let’s say our corporate ontologist comes up with this schematic description of cubicle assignments:

  • eg:hasCubicle has a domain of eg:Employee, a range of eg:Cubicle. It is an owl:FunctionalProperty, because any Employee has at most one Cubicle related via hasCubicle.
  • it is an owl:InverseFunctionalProperty, because any Cubicle is the value of hasCubicle for no more than one Employee.

So… at beginning of 2011 it would be truthy to assert these RDF claims:

Now, come March 10th, everyone at the company receives an all-staff email from Dogbert, with cubicle reassignments. Amongst other changes, Alice and Bob are swapping cubicles, and Charlie stays in c-1002.

Within a week or so (let’s say by March 20th to be sure) The cubicle moves are all made real, in terms of where people are supposed to be based, where they are, and where their stuff and phone line routings are.

The fictional world by March 20th 2011 is now truthily described by the following claims:

Questions / view from Named Graphs.

1. Was it a mistake, bad modelling style etc, to describe things with ’hasCubicle’? Should we have instead described a date-stamped ‘CubicleAssignmentEvent’ that mentions for example the roles of Dogbert, Alice, and some Cubicle? Is there a ‘better’ way to describe things? Is this an acceptable way to describe things?

2. How should we express then the notion that each employee has at most one cubicle and vice versa? Is this
appropriate material to try to capture in OWL?

3. How should a SPARQL store or TriG++ document capture the different graphs describing the evolving state of the company’s office-space allocations?

4. Can we offer any practical but machine-readable metadata that helps indicate to consuming applications
the potential problems that might come from merging different graphs that use this modelling style?
For example, can we write any useful definition for a class of property “TimeVolatileProperty” that could help people understand risk of merging different RDF graphs using ‘hasCubicle’?

5. Can the ‘snapshot of the world-as-it-now-is’ view and the ’transaction / event log view’ be equal citizens, stored in the same RDF store, and can metadata / manifest / table of contents info for that store be used to make the information usefully exploitable and reasonably truthy?

8 Responses to Dilbert schematics

  1. This is a classical problem when you try to model n-ary relations using a binary relation-based language. Assuming RDF won’t change to use n-tuples instead of triples, I would go for creating blank nodes of type CubicleAssignmentEvent or any other name you want to give to it, which will be linked to the employee’s instances. These events should be linked to the cubicle URI as well as a date entity (using time ontology, or simply a date^^xsd:date). Of course you can use OWL to force to have no more or less than one event date and no more or less than cubicle per event, and so on…

    The REAL problem is that you do all this only when you see the problem ahead, which may be even years after you did the original modeling. This implies that you missed data, you will probably have to modify code, etc…

    On the other hand, even if you define this consideration of temporal dimension by default as a ‘good practice’ (which is not really clear to me it is), it probably won’t be widely adopted, for being considered cumbersome and an overkill for lots of cases…

    My USD 0.02
    Alvaro Graves

  2. danbri says:

    Yes, exactly – the problems come from being constrained to binary relations.

    Check out Pat Hayes’s notes on using the 4th place: http://lists.w3.org/Archives/Public/public-rdf-wg/2011Nov/0019.html

  3. nic gould says:

    There are two questions we should easily be able to answer from the data model: who is occupying a cubicle right now, and who was occupying it on date x.

    If we model this in a single graph, using a basic hasCurrentCubicle predicate to record the cubicle of an employee and OWL to restrict this to one current unique cubicle per employee, this easily answers the first question.

    In addition if we record a cubicle occupancy history using blank nodes of type cubicle occupancy and recording the start and end dates for the occupancy of a cubicle by any employee then this enables the second question to be answered (albeit with a messier query).

    Also, rather than creating the slightly abstract cubicle occupancy type we could create the memo as a resource with a date and record the cubicle moves as properties of the memo.

    I’d favour the above over any solution using named graphs.

  4. [...] A new article by Dan Brickley looks at RDF through the lens of beloved comic strip, Dilbert. The article begins, “How can we package, manage, mix and merge graph datasets that come from different contexts, without getting our data into a terrible mess? During the last W3C RDF Working Group meeting, we were discussing approaches to packaging up ‘graphs’ of data into useful chunks that can be organized and combined. A related question, one always lurking in the background, was also discussed: how do we deal with data that goes out of date?” [...]

  5. [...] A new article by Dan Brickley looks at RDF through the lens of beloved comic strip, Dilbert. The article begins, “How can we package, manage, mix and merge graph datasets that come from different contexts, without getting our data into a terrible mess? During the last W3C RDF Working Group meeting, we were discussing approaches to packaging up ‘graphs’ of data into useful chunks that can be organized and combined. A related question, one always lurking in the background, was also discussed: how do we deal with data that goes out of date?” continued… [...]

  6. patrickdlogan says:

    I would be interested in seeing evaluations of various solutions in RDF per se before I would feel comfortable going to the “fourth place”. At this point it seems there is a lot more experience modeling in RDF than modeling in “RDF + 1″. The standards won’t be able to repeal “RDF + 1″ very easily so they should move with caution in that direction.

    In this scenario, it seems the aspect of interest is “change over time” of some relationships. And so I believe that “change over time” should be model this scenario as first-class triples. Because of this I would even shy away from blank nodes — these “change over time” relationships seem to me to be worthy of URLs.

    I have another concern basing such important discussions on a “make believe” scenario. Better to determine real scenarios that have some measurable cost/benefit. The scenario is fine to begin a discussion, but until a real application has something to win or lose, we’ll not have a full measure of the impact.

  7. I was thinking about this yesterday.

    I’m a semantic web n00b, so forgive me if this is stupid, but couldn’t we deal with this issue by just reifying each of the “hasCubicle” statements and then adding a meta-statement about each along the lines of “trueForTimeRange”?

    So we would have

    (:alice :hasCubicle :c1) :trueForTimeRange :beforeMarch20

    then

    ((:alice :hasCubicle :c1) :trueForTimeRange :afterMarch20

    Then we could add a fairly simple functionality to SPARQL to “timeline” reconcile overlapping statements.

    I see that as adding an additional burden to the producers of the RDF, which would fight against adoption. But it could also be done retrospectively to correct for not planning ahead, so it would make it fairly easy to adopt when it became necessary.

    What do you think?

Leave a Reply