XMPP untethered – serverless messaging in the core?

In the XMPP session at last february’s FOSDEM I gave a brief demo of some NoTube work on how TV-style remote controls might look with XMPP providing their communication link. For the TV part, I showed Boxee, with a tiny Python script exposing some of its localhost HTTP API to the wider network via XMPP. For the client, I have a ‘my first iphone app‘ approximation of a remote control that speaks a vapourware XMPP remote control protocol, “Buttons”.

The point of all this is about breaking open the Web-TV environment, so that different people and groups get to innovate without having to be colleagues or close-nit business partners. Control your Apple TV with your Google Android phone; or your Google TV with your Apple iPad, or your Boxee box with either. Write smart linking and bookmarking and annotation apps that improve TV for all viewers, rather than only those who’ve bought from the same company as you. I guess I managed to communicate something of this because people clapped generously when my iphone app managed to pause Boxee. This post is about how we might get from evocative but toy demos to a useful and usable protocol, and about one of our largest obstacles: XMPP’s focus on server-mediated communications.

So what happened when I hit the ‘pause’ button on the iphone remote app? Well, the app was already connected to the XMPP network, e.g. signed in as bob.notube@gmail.com via Google Talk’s servers. And so an XMPP stanza flowed out from the room we were in, across to Google somewhere, and then via XMPP server-to-server protocol over to my self-run XMPP server (an ejabberd hosted on Amazon EC2’s east USA zone somewhere). And from there, the message returned finally to Brussels, flowing through whichever Python library I was using to Boxee (signed in as buttons@foaf.tv), causing the video to pause. This happened quite quickly, and generally very quickly; but sometimes it can take more than a second. This can be very frustrating, and while there are workaround (keep-alive messages, smart code that ignores sequences of buffered ‘Pause!’ messages, apps that download metadata and bring more UI to the second screen, …), the problem has a simple cause: it just doesn’t make sense for a ‘pause’ message to cross the atlantic twice, and pass through two XMPP servers, on its the way across the living room from remote control to TV.

But first – why are we even using XMPP at all, rather than say HTTP? Partly because XMPP lets us easily address devices on home networks, that aren’t publically exposed as running a Web server. Partly for the symmetry of the protocol, since ipads, touch tables, smart phones, TVs and media centres all can host and play media items on their own displays, and we may have several such devices in a home setting that need to be in touch with one another. There’s also a certain lazyness; XMPP already defines lots of useful pieces, like buddylist rosters, pubsub notifications, group chats; it has an active and friendly community, and it comes with a healthy collection of tools and libraries. My own interests are around exploring and collectively annotating the huge archives of content that are slowly coming online, and an expectation that this could be a more shared experience, so I’m following an intuition that XMPP provides more useful ‘raw materials’ for social content exploration than raw HTTP. That said, many elements of remote control can be defined and implemented in either environment. But for today, I’m concentrating on the XMPP side.

So back at FOSDEM I raised a couple of concerns, as a long-term XMPP well-wisher but non-insider.

The first was that the technology presents itself as a daunting collection of extensions, each of which might or might not be supported in some toolkit. To this someone (likely Dave Cridland) responded with the reassuring observation that most of these could be implemented by 3rd party app developer simply reading/writing XMPP stanzas. And that in fact pretty much the only ‘core’ piece of XMPP that wasn’t treated as core in most toolkits was the serverless, point-to-point XEP-0174 ‘serverless messaging‘ mode. Everything else, the rest of us mortals could hack in application code. For serverless messaging we are left waiting and hoping for the toolkit maintainers to wire things in, as it generally requires fairly intimate knowledge of the relevant XMPP library.

My second point was in fact related: that if XMPP tools offered better support for serverless operation, then it would open up lots of interesting application options. That we certainly need it for the TV remotes use case to be a credible use of XMPP. Beyond TV remotes, there are obvious applications in the area of open, decentralised social networking. The recent buzz around things like StatusNet, GNU Social, Diaspora*, WebID, OneSocialWeb, alongside the old stuff like FOAF, shows serious interest in letting users take more decentralised control of their online social behaviour. Whether the two parties are in the same room on the same LAN, or halfway around the world from each other, XMPP and its huge collection of field-tested, code-supported extensions is relevant, even when those parties prefer to communicate directly rather than via servers.

With XMPP, app party developers have a well-defined framework into which they can drop ad-hoc stanzas of information; whether it’s a vCard or details of recently played music. This seems too useful a system to reserve solely for communications that are mediated by a server. And indeed, XMPP in theory is not tied to servers; the XEP-0174 spec tells us both how to do local-network bonjour-style discovery, and how to layer XMPP on top of any communication channel that allows XML stanzas to flow back and forth.

From the abstract,

This specification defines how to communicate over local or wide-area networks using the principles of zero-configuration networking for endpoint discovery and the syntax of XML streams and XMPP messaging for real-time communication. This method uses DNS-based Service Discovery and Multicast DNS to discover entities that support the protocol, including their IP addresses and preferred ports. Any two entities can then negotiate a serverless connection using XML streams in order to exchange XMPP message and IQ stanzas.

But somehow this remains a niche use of XMPP. Many of the toolkits have some support for it, perhaps as work-in-progress or a patch, but it remains somewhat ‘out there’ rather than core to the XMPP approach. I’d love to see this change in 2011. The 0174 spec combines a few themes; it talks a lot about discovery, motivated in part by trade-fair and conference type scenarios. When your Apple laptop finds people locally on some network to chat with by “Bonjour”, it’s doing more or less XEP-0174. For the TV remote scenario, I’m interested in having nodes from a normal XMPP network drop down and “re-discover” themselves in a hopefully-lower-latency point to point mode (within some LAN or across the Internet, or between NAT-protected home LANs). There are lots of scenarios when having a server in the loop isn’t needed, or adds cost and risk (latency, single point of failure, privacy concerns).

XEP-0174 continues,

6. Initiating an XML Stream
In order to exchange serverless messages, the initiator and
recipient MUST first establish XML streams between themselves,
as is familiar from RFC 3920.
First, the initiator opens a TCP connection at the IP address
and port discovered via the DNS lookup for an entity and opens
an XML stream to the recipient, which SHOULD include 'to' and
'from' address. [...]

This sounds pretty precise; point-to-point communication is over TCP.  The Security Considerations section discussed some of the different constraints for XMPP in serverless mode, and states that …

To secure communications between serverless entities, it is RECOMMENDED to negotiate the use of TLS and SASL for the XML stream as described in RFC 3920

Having stumbled across Datagram TLS (wikipedia, design writeup), I wonder whether that might also be an option for the layer providing the XML stream between entities.  For example, the chownat tool shows a UDP-based trick for establishing bidirectional communication between entities, even when they’re both behind NAT. I can’t help but wonder whether XMPP could be layered somehow on top of that (OpenSSL libraries have Datagram TLS support already, apparently). There are also other mechanisms I’ve been discussing with Mo McRoberts and Libby Miller lately, e.g. Mo’s dynamic dns / pubkeys idea, or his trick of running an XMPP server in the home, and opening it up via UPnP. But that’s for another time.

So back on my main theme: XMPP is holding itself back by always emphasising the server-mediated role. XEP-0174 has the feel of an afterthought rather than a core part of what the XMPP community offers to the wider technology scene, and the support for it in toolkits lags similarly. I’d love to hear from ‘live and breath XMPP’ folk what exactly they think is needed before it can become a more central part of the XMPP world.

From the TV remotes use case we have a few constraints, such as the need to associate identities established in different environments (eg. via public key). If xmpp:danbri-ipad@danbri.org is already on the server-based XMPP roster of xmpp:nevali-tv@nevali.net, can pubkey info in their XMPP vCards be used to help re-establish trusted communications when the devices find themselves connected in the same LAN? It seems just plain nuts to have a remote control communicate with another box in the same room via transatlantic links through Google Talk and Amazon EC2, and yet that’s the general pattern of normal XMPP communications. What would it take to have more out-of-the-box support for XEP-0174 from the XMPP toolkits? Some combination of beer, money, or a shared sense that this is worth doing and that XMPP has huge potential beyond the server-based communications model it grew from?

How to tell you’re living in the future: bacterial computers, HTML and RDF

Clue no.1. Papers like “Solving a Hamiltonian Path Problem with a bacterial computer” barely raise an eyebrow.

Clue no.2. Undergraduates did most of the work.

And the clincher, …

Clue no.3. The paper is shared nicely in the Web, using HTML, Creative Commons document license, and useful RDF can be found nearby.

From those-crazy-eggheads dept, … bacterial computers solving graph data problems. Can’t wait for the javascript API. Except the thing of interest here isn’t so much the mad science but what they say about how they did it. But the paper is pretty fun stuff too.

The successful design and construction of a system that enables bacterial computing also validates the experimental approach inherent in synthetic biology. We used new and existing modular parts from the Registry of Standard Biological Parts [17] and connected them using a standard assembly method [18]. We used the principle of abstraction to manage the complexity of our designs and to simplify our thinking about the parts, devices, and systems of our project. The HPP bacterial computer builds upon our previous work and upon the work of others in synthetic biology [19-21]. Perhaps the most impressive aspect of this work was that undergraduates conducted every aspect of the design, modeling, construction, testing, and data analysis.

Meanwhile, over on partsregistry.org you can read more about the bits and pieces they squished together. It’s like a biological CPAN. And in fact the anology is being actively pursued: see openwetware.org’s work on an RDF description of the catalogue.

I grabbed an RDF file from that site and confirm that simple queries like

select * from <SemanticSBOLv0.13_BioBrick_Data_v0.13.rdf>  where {<http://sbol.bhi.washington.edu/rdf/sbol.owl#BBa_I715022> ?p ?v }

and

select * from <SemanticSBOLv0.13_BioBrick_Data_v0.13.rdf>  where {?x ?p <http://sbol.bhi.washington.edu/rdf/sbol.owl#BBa_I715022>  }

… do navigate me around the graph that describes the pieces described in their paper.

Here’s what the HTML paper says right now,

We designed and built all the basic parts used in our experiments as BioBrick compatible parts and submitted them to the Registry of Standard Biological Parts [17]. Key basic parts and their Registry numbers are: 5′ RFP (BBa_I715022), 3′ RFP (BBa_ I715023), 5′ GFP (BBa_I715019), and 3′ GFP (BBa_I715020). All basic parts were DNA sequence verified. The basic parts hixC(BBa_J44000), Hin LVA (BBa_J31001) were used from our previous experiments [8]. The parts were assembled by the BioBrick standard assembly method [18] yielding intermediates and devices that were also submitted to the Registry. Important intermediate and devices constructed are: Edge A (BBa_S03755), Edge B (BBa_S03783), Edge C (BBa_S03784), ABC HPP construct (BBa_I715042), ACB HPP construct (BBa_I715043), and BAC HPP construct (BBa_I715044). We previously built the Hin-LVA expression cassette (BBa_S03536) [8].

How nice to have a scholarly publication in HTML format, open-access published under creative commons license, and backed by machine-processable RDF data. Never mind undergrads getting bacteria to solve NP-hard graph problems, it’s the modern publishing and collaboration machinery described here that makes me feel I’m living in the future…

(World Wide Web – Let’s Share What We Know…)

ps. thanks to Dan Connolly for nudging me to get this shared with the planetrdf.com-reading community. Maybe it’ll nudge Kendall into posting something too.

‘Republic of Letters’ in R / Custom Widgets for Second Screen TV navigation trails

As ever, I write one post that perhaps should’ve been two. This is about the use and linking of datasets that aid ‘second screen’ (smartphone, tablet) TV remotes, and it takes as a quick example a navigation widget and underlying dataset that show us how we might expect to navigate TV archives, in some future age when TV lives more fully in the World Wide Web. I argue that access to the ‘raw data‘ and frameworks for embedding visualisation apps are of equal importance when thinking about innovative ways of exploring the ever-growing archives. All of this comes from many discussions with my NoTube colleagues and other collaborators; rambling scribblyness is all my own.

Ben Hammersley points us at a lovely Flash visualization http://www.stanford.edu/group/toolingup/rplviz/”>Mapping the Republic of Letters”.

From the YouTube overview, “Researchers map thousands of letters exchanged in the 18th century’s “Republic of Letters” and learn at a glance what it once took a lifetime of study to comprehend.”


Mapping the Republic of Letters has at its center a multidimensional data set which spans 300 years and nearly 100,000 letters. We use computing tools that help us to measure and analyze data quantitatively, though that will not take us to our goal. While we use software and computing techniques that were designed for scientific and statistical methods, we are seeking to develop computing tools to enhance humanistic methods, to help us to explore qualitative aspects of the Republic of Letters. The subject of our study and the nature of the material require it. The collections of correspondence and records of travel from this period are incomplete. Of that incomplete material only a fraction has been digitized and is available to us. Making connections and resolving ambiguities in the data is something that can only be done with the help of computing, but cannot be done by computing alone. (from ‘methods and philosophy‘)


screenshot of Republic of Letters app, showing social network links superimposed on map of historical western Europe


See their detailed writeup for more on this fascinating and quite beautiful work. As I’m working lately on linking TV content more deeply into the Web, and on ‘second screen’ navigation, this struck me as just the kind of interface which it ought to be possible to re-use on a tablet PC to explore TV archives. Forgetting for the moment difficulties with Flash on iPads and so on, the idea roughly is that it would be great to embed such a visualization within a TV watching environment, such that when the ‘republic of letters’ widget is focussed on some person, place, or topic, we should have the opportunity to scan the available TV archives for related materials to show.

So a glance at Chrome’s ‘developer tools’ panel gave me a link to the underlying data used by the visualisation. I don’t know exactly whose it is, nor how they want it used, so please treat it with respect. Still, there it is, sat in the Web, in tab-separated format, begging to be used. There’s a lot you can do with the Flash application that I’ve barely touched, but I’m intrigued by the underlying dataset. In particular, where they have the string “Tonson, Jacob”, the data linker in me wants to see a Wikipedia or DBpedia link, since they provide explanation, context, related people, places and themes; all precious assets when trying to scrape together related TV materials to inform, educate or entertain someone with. From a few test searches, it turns out that (many? most?) the correspondents are quite easily matched to Wikipedia: William Congreve, Montagu, 1st earl of Halifax, CharlesHough, bishop of Worcester, John; Stanyan, Abraham;  … Voltaire and others. But what about the data?

Lately I’ve been learning just a little about R, a language used mainly for statistics and related analysis. Here’s what it’ll do ‘out of the box’, in untrained hands:

letters<-read.csv('data.txt',sep='\t', header=TRUE)
v_author = letters$Author=="Voltaire"
v_letters = letters[v_author, ]
Where were Voltaire’s letters sent?
> cbind(summary(v_letters$dest_country))
[,1]
Austria            2
Belgium            6
Canada             0
Denmark            0
England           26
France          1312
Germany           97
India              0
Ireland            0
Italy             68
Netherlands       22
Portugal           0
Russia             5
Scotland           0
Spain              1
Sweden             0
Switzerland      342
The Netherlands    1
Turkey             0
United States      0
Wales              0
As the overview and video in the ‘Republic of Letters‘ site points out (“Tracking 18th-century “social network” through letters”), the patterns of correspondence eg. between Voltaire and e.g. England, Scotland and Ireland jumps out of the data (and more so its visualisation). There are countless ways this information could be explored, presented, sliced-and-diced. Only a custom app can really make the most of it, and the Republic of Letters work goes a long way in that direction. They also note that
The requirements of our project are very much in sync with current work being done in the linked-data/ semantic web community and in the data visualization community, which is why collaboration with computer science has been critical to our project from the start.
So the raw data in the Web here is a simple table; while we could spend time arguing about whether it would better be expressed in JSON, XML or an RDF notation, I’d rather see some discussion around what we can do with this information. In particular, I’m intrigued by the possibilities of R alongside the data-linking habits that come with RDF. If anyone manages to tease anything interesting from this dataset, perhaps mixed in with DBpedia, do post your results.
And of course there are always other datasets to examine; for example see the Darwin correspondence archives, or the Open Knowledge Foundation’s Open Correspondence project which has a Dickens-based pilot. While it is wonderful having UI that is tuned to the particulars of some dataset, it is also great when we can re-use UI code to explore similarly structured data from elsewhere. On both the data side and the UI side, this is expensive, tough work to do well. My current concern is to maximise re-use of both UI and data for the particular circumstances of second-screen TV navigation, a scenario rarely a first priority for anyone!
My hope is that custom navigation widgets for this sort of data will be natural components of next-generation TV remote controls, and that TV archives (and other collections) will open up enough of their metadata to draw in (possibly paying) viewers. To achieve this, we need the raw data on both sides to be as connectable as possible, so that application authors can spend their time thinking about what their users really need and can use, rather than on whether they’ve got the ‘right’ Henry Newton.
If we get it right, there’s a central role for librarianship and archivists in curating the public, linked datasets that tell us about the people, places and topics that will allow us to make new navigation trails through Web-connected television, literature and encyclopedia content. And we’ll also see new roles for custom visualizations, once we figure out an embedding framework for TV widgets that lets them communicate with a display system, with other users in the same room or community, and that is designed for cross-referencing datasets that talk about the same entities, topics, places etc.
As I mentioned regarding Lonclass and UDC, collaboration around open shared data often takes place in a furtive atmosphere of guilt and uncertainty. Is it OK to point to the underlying data behind a fantastic visualisation? How can we make sure the hard work that goes into that data curation is acknowledged and rewarded, even while its results flow more freely around the Web, and end up in places (your TV remote!) that may never have been anticipated?

Lonclass and RDF

Lonclass is one of the BBC’s in-house classification systems – the “London classification”. I’ve had the privilege of investigating lonclass within the NoTube project. It’s not currently public, but much of what I say here is also applicable to the Universal Decimal Classification (UDC) system upon which it was based. UDC is also not fully public yet; I’ve made a case elsewhere that it should be, and I hope we’ll see that within my lifetime. UDC and Lonclass have a fascinating history and are rich cultural heritage artifacts in their own right, but I’m concerned here only with their role as the keys to many of our digital and real-world archives.

Why would we want to map Lonclass or UDC subject classification codes into RDF?

Lonclass codes can be thought of as compact but potentially complex sentences, built from the thousands of base ‘words’ in the Lonclass dictionary. By mapping the basic pieces, the words, to other data sources, we also enrich the compound sentences. We can’t map all of the sentences as there can be infinitely many of them – it would be an expensive and never-ending task.

For example, we might have a lonclass code for “Report on the environmental impact of the decline of tin mining in Sweden in the 20th century“. This would be an jumble of numbers and punctuation which I won’t trouble you with here. But if we parsed out that structure we can see the complex code as built from primitives such as ‘tin mining’ (itself e.g. ‘Tin’ and ‘Mining’), ‘Sweden’, etc. By linking those identifiable parts to shared Web data, we also learn more about the complex composite codes that use them. Wikipedia’s Sweden entry tells us in English, “Sweden has land borders with Norway to the west and Finland to the northeast, and water borders with Denmark, Germany, and Poland to the south, and Estonia, Latvia, Lithuania, and Russia to the east.”. Increasingly this additional information is available in machine-friendly form. Although right now we can’t learn about Sweden’s borders from the bits of Wikipedia reflected into DBpedia’s Sweden entry, but UN FAO’s geopolitical ontology does have this information and more in RDF form.

There is more, much more, to know about Sweden than can possibly be represented directly within Lonclass or UDC. Yet those facts may also be very useful for the retrieval of information tagged with Sweden-related Lonclass codes. If we map the Lonclass notion of ‘Sweden’ to identified concepts described elsewhere, then whenever we learn more about the latter, we also learn more about the former, and indirectly, about anything tagged with complex lonclass codes using that concept. Suddenly an archived TV documentary tagged as covering a ‘report on the environmental impact of the decline of tin mining in Sweden’ is accessible also to people or machines looking under Scandinavia + metal mining. Environmental matters, after all, often don’t respect geo-political borders; someone searching for coverage of environmental trends in a neighbouring country might well be happy to find this documentary. But should Lonclass or UDC maintain an index of which countries border which others? Surely not!

Lonclass and UDC codes have a rich hidden structure that is rarely exploited with modern tools. Lonclass by virtue of its UDC heritage, does a lot of work itself towards representing complex conceptual inter-relationships. It embodies a conceptual map of our world, with mysterious codes (well known in the library world) for topics such as ‘622 – mining’, but also specifics e.g. ‘622.3 Mining of specific minerals, ores, rocks’, and combinations (‘622.3:553.9 Extraction of carbonaceous minerals, hydrocarbons’). By joining a code for ‘mining a specific mineral…’ to a code for ‘553.9 Deposits of carbonaceous rocks. Hydrocarbon deposits’ we get a compound term. So Lonclass/UDC “knows” about the relationship between “Tin Mining” and “Mining”, “metals” etc., and quite likely between “Sweden” and “Scandinavia”. But it can’t know everything! Sooner or later, we have to say, “Sorry, it’s not reasonable to expect the classification system to model the entire world; that’s a bigger problem”.

Even within the closed, self-supporting universe of UDC/Lonclass, this compositional semantics system is a very powerful tool for describing obscure topics in terms  of well known simpler concepts. But it’s too much for any single organisation (whether the BBC, the UDC Consortium, or anyone) to maintain and extend such a system to cover all of modern life; from social, legal and business developments to new scientific innovations. The work needs to be shared, and RDF is currently our best bet on how to create such work sharing, meaning sharing, information-linking systems in the Web. The hierarchies in UDC and Lonclass don’t attempt to represent all of objective reality; they instead show paths through information.

If the metaphor of a ‘conceptual map’ holds up, then it’s clear that at some point it’s useful to have our maps made by different parties, with different specialised knowledge. The Web now contains a smaller but growing Web of machine readable descriptions. Over at MusicBrainz is a community who take care of describing the entities and relationships that cover much of music, or at least popular music. Others describe countries, species, genetics, languages, historical events, economics, and countless other topics. The data is sometimes messy or an imperfect fit for some task-in-hand, but it is actively growing, curated and connected.

I’m not arguing that Lonclass or UDC should be thrown out and replaced by some vague ‘linked cloud’. Rather, that there are some simple steps that can be taken towards making sure each of these linked datasets contribute to modernising our paths into the archives. We need to document and share opensource tools for an agreed data model for the arcane numeric codes of UDC and Lonclass. We need at least the raw pieces, the simplest codes, to be described for humans and machines in public, stable Web pages, and for their re-use, mapping, data mining and re-combination to be actively encouraged and celebrated. Currently, it is possible to get your hands on this data if you work with the BBC (Lonclass), pay license fees (UDC) or exchange USB sticks with the right party in some shady backstreet. Whether the metaphor of choice is ‘key to the archives’ or ‘conceptual map of…’, this is a deeply unfortunate situation, both for the intrinsic public value of these datasets, but also for the collections they index. There’s a wealth of meaning hidden inside Lonclass and UDC and the collections they index, a lot that can be added by linking it to other RDF datasets, but more importantly there are huge communities out there who’ll do much of the work when the data is finally opened up…

I wrote too much. What I meant to say is simple. Classification systems with compositional semantics can be enriched when we map their basic terms using identifiers from other shared data sets. And those in the UDC/Lonclass tradition, while in some ways they’re showing their age (weird numeric codes, huge monolithic, hard-to-maintain databases), … are also amongst the most interesting systems we have today for navigating information, especially when combined with Linked Data techniques and companion datasets.

Disambiguating with DBpedia

Sketchy notes. Say you’re looking for an identifier for something, and you know it’s a company/organization, and you have a label “Woolworths”.

What can be done to choose amongst the results we find in DBpedia for this crude query?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?x where {
?x a <http://dbpedia.org/ontology/Organisation>;  rdfs:label ?l .
FILTER(REGEX(?l, “Woolworths*”)).
}

More generally, are the tweaks and tricks needed to optimise this sort of disambiguation going to be cross-domain, or do we have to hand-craft them, case by case?

Sagan on libraries

“Books permit us to voyage through time, to tap the wisdom of our ancestors. The library connects us with the insights and knowledge, painfully extracted from Nature, of the greatest minds that ever were, with the best teachers, drawn from the entire planet and from all of our history, to instruct us without tiring, and to inspire us to make our own contribution to the collective knowledge of the human species. Public libraries depend on voluntary contributions. I think the health of our civilization, the depth of our awareness about the underpinnings of our culture and our concern for the future can all be tested by how well we support our libraries.” –Carl Sagan, http://en.wikiquote.org/wiki/Carl_Sagan

Easier in RDFa: multiple types and the influence of syntax on semantics

RDF is defined as an abstract data model, plus a collection of practical notations for exchanging RDF descriptions (eg. RDF/XML, RDFa, Turtle/N3). In theory, your data modelling activities are conducted in splendid isolation from the sleazy details of each syntax. RDF vocabularies define classes of thing, and various types of property/relationship that link those things. And then instance data uses arbitrary combinations of those vocabularies to make claims about stuff. Nothing in your vocabulary design says anything about XML or text formats or HTML or other syntactic details.

All that said, syntactic considerations can mess with your modelling. I’ve just written this up for the Linked Library Data group, but since the point isn’t often made, I thought I’d do so here too.

RDF instance data, ie. descriptions of stuff, is peculiar in that it lets you use multiple independent schemas at the same time. So I might use SKOS, FOAF, Bio, Dublin Core and DOAP all jumbled up together in one document. But there are some considerations when you want to mention that something is in multiple classes. While you can do this in any RDF notation, it is rather ugly in RDF/XML, historically RDF’s most official, standard notation. Furthermore, if you want to mention that two things are related by two or more specified properties, this can be super ugly in RDF/XML. Or at least rather verbose. These practical facts have tended to guide the descriptive idioms used in real world RDF data. RDFa changes the landscape significantly, so let me give some examples.

Backstory – decentralised extensibility

RDF classes from one vocabulary can be linked to more general or specific classes in another; we use rdfs:subClassOf for this. Similarly, RDF properties can be linked with rdfs:subPropertyOf claims. So for example in FOAF we might define a class foaf:Organization, and leave it at that. Meanwhile over in the Org vocabulary, they care enough to distinguish a subclass, org:FormalOrganization. This is great! Incremental, decentralised extensibility. Similarly, FOAF has foaf:knows as a basic link between people who know each other, but over in the relationship vocabulary, that has been specialized, and we see relationships like ‘livesWith‘, ‘collaboratesWith‘. These carry more specific meaning, but they also imply a foaf:knows link too.

This kind of machine-readable (RDFS/OWL) documentation of the patterns of meaning amongst properties (and classes) has many uses. It could be used to infer missing information: if Ian writes RDF saying “Alice collaboratesWith Bob” but doesn’t explicitly say that Alice also knows Bob, a schema-aware processor can add this in. Or it can be used at query time, if someone asks “who does Alice know?”. But using this information is not mandatory, and this creates a problem for publishers. Should they publish redundant information to make it easier for simple data consumers to understand the data without knowing about the more detailed (and often more recent) vocabulary used?

Historically, adding redundant triples to capture the more general claims has been rather expensive – both in terms of markup beauty, and also file size. RDFa changes this.

Here’s a simple RDF/XML description of something.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/">
<foaf:Person rdf:about="#fred">
 <foaf:name>Fred Flintstone</foaf:name>
</foaf:Person>
</rdf:RDF>

…and here is how it would have to look if we wanted to add a 2nd type:

<foaf:Person rdf:about="#fred"
 rdf:type="http://example.com/vocab2#BiblioPerson">
  <foaf:name>Fred Flintstone</foaf:name>
</foaf:Person>
</rdf:RDF>

To add a 3rd or 4th type, we’d need to add in extra subelements eg.

<rdf:type rdf:resource="http://example.com/vocab2#BiblioPerson"/>

Note that the full URI for the vocabulary needs to be used at every occurence of the type.  Here’s the same thing, with multiple types, in RDFa.

<html>
<head><title>a page about Fred</title></head>
<body>
<div xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:vocab2="http://example.com/vocab2#"
 about="#fred" typeof="foaf:Person vocab2:BiblioPerson" >
<span property="foaf:name">Fred Flintstone</span>
</div>
</body>
</html>

RDFa 1.0 requires the second vocabulary’s namespace to be declared, but after that it is pretty concise if you want to throw in a 2nd or a 3rd type, for whatever you’re describing. If you’re talking about a relationship between people, instead of ” rel=’foaf:knows’ ” you could put “rel=’foaf:knows rel:livesWith’ “; if you wanted to mention that something was in the class not just of organizations, but formal organizations, you could write “typeof=’foaf:Organization org:FormalOrganization'”.

Properties and classes serve quite different social roles in RDF. The classes tend towards being dull, boring, because they are the point of connection between different datasets and applications. The detail, personality and real information content in RDF lives in the properties. But both classes and properties fall into specialisation hierarchies that cross independent vocabularies. It is quite a common experience to feel stuck, not sure whether to use a widely known but vague term, or a more precise but ‘niche’, new or specialised vocabulary. As RDF syntaxes improve, this tension can melt away somewhat. In RDFa it is significantly easier to simply publish both, allowing smart clients to understand your full detail, and simple clients to find the patterns they expect without having to do schema-based processing.

Subject classification and Statistics

Subject classification and statistics share some common problems. This post takes a small example discussed at this week’s ODaF event on “Semantic Statistics” in Tilberg, and explores its expression coded in the Universal Decimal Classification (UDC). UDC supports faceted description, providing an abstract grammar allowing sentence-like subject descriptions to be composed from the “raw materials” defined in its vocabulary scheme.

This makes the mapping of UDC (and to some extent also Dewey classifications)  into W3C’s SKOS somewhat lossy, since patterns and conventions for documenting these complex, composed structures are not yet well established. In the NoTube project we are looking into this in a TV context, in large part because the BBC archives make extensive use of UDC via their Lonclass scheme; see my ‘investigating Lonclass‘ and UDC seminar talk for more on those scenarios. Until this week I hadn’t thought enough about the potential for using this to link deep into statistical datasets.

One of the examples discussed on Tuesday was as follows (via Richard Cyganiak):

“There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″

There was much interesting discussion in Tilburg about the proper scope and role of Linked Data techniques for sharing this kind of statistical data. Do we use RDF essentially as metadata, to find ‘black boxes’ full of stats, or do we use RDF to try to capture something of what the statistics are telling us about the world? When do we use RDF as simple factual data directly about the world (eg. school X has N pupils [currently; or at time t]), and when does it become a carrier for raw numeric data whose meaning is not so directly expressed at the factual level?

The state of the art in applying RDF here seems to be SDMX-RDF, see Richard’s slides. The SDMX-RDF work uses SKOS to capture code lists, to describe cross-domain concepts and to indicate subject matter.

Given all this, I thought it would be worth taking this tiny example and looking at how it might look in UDC, both as an example of the ‘compositional semantics’ some of us hope to capture in extended SKOS descriptions, but also to explore scenarios that cross-link numeric data with the bibliographic materials that can be found via library classification techniques such as UDC. So I asked the ever-helpful Aida Slavic (editor in chief of the UDC), who talked me through how this example data item looks from a UDC perspective.

I asked,

So I’ve just got home from a meeting on semweb/stats. These folk encode data values with stuff like “There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″. How much of that could have a UDC coding? I guess I should ask, how would subject index a book whose main topic was “occupational injuries in the Washington DC metro area in 2008″?

Aida’s reply (posted with permission):

You can present all of it & much more using UDC. When you encode a subject like this in UDC you store much more information than your proposed sentence actually contains. So my decision of how to ‘translate this into udc’ would depend on learning more about the actual text and the context of the message it conveys, implied audience/purpose, the field of expertise for which the information in the document may be relevant etc. I would probably wonder whether this is a research report, study, news article, textbook, radio broadcast?

Not knowing more then you said I can play with the following: 331.46(735.215.2/.4)”2008

Accidents at work — Washington metropolitan area — year 2008
or a bit more detailed:  331.46-053.18(735.215.2/.4)”2008
Accidents at work — dead persons – Washington metropolitan area — year 2008
[you can say the number of dead persons but this is not pertinent from point of view of indexing and retrieval]

…or maybe (depending what is in the content and what is the main message of the text) and because you used the expression ‘fatal injuries’ this may imply that this is more health and safety/ prevention area in health hygiene which is in medicine.

The UDC structures composed here are:

TIME “2008”

PLACE (735.215.2/.4)  Counties in the Washington metropolitan area

TOPIC 1
331     Labour. Employment. Work. Labour economics. Organization of  labour
331.4     Working environment. Workplace design. Occupational safety.  Hygiene at work. Accidents at work
331.46  Accidents at work ==> 614.8

TOPIC 2
614   Prophylaxis. Public health measures. Preventive treatment
614.8    Accidents. Risks. Hazards. Accident prevention. Persona protection. Safety
614.8.069    Fatal accidents

NB – classification provides a bit more context and is more precise than words when it comes to presenting content i.e. if the content is focused on health and safety regulation and occupation health then the choice of numbers and their order would be different e.g. 614.8.069:331.46-053.18 [relationship between] health & safety policies in prevention of fatal injuries and accidents at work.

So when you read  UDC number 331.46 you do not see only e.g. ‘accidents at work’ but  ==>  ‘accidents at work < occupational health/safety < labour economics, labour organization < economy
and when you see UDC number 614.8  it is not only fatal accidents but rather ==> ‘fatal accidents < accident prevention, safety, hazards < Public health and hygiene. Accident prevention

When you see (735.2….) you do not only see Washington but also United States, North America

So why is this interesting? A couple of reasons…

1. Each of these complex codes combines several different hierarchically organized components; just as they can be used to explore bibliographic materials, similar approaches might be of value for navigating the growing collections of public statistical data. If SKOS is to be extended / improved to better support subject classification structures, we should take care also to consider use cases from the world of statistics and numeric data sharing.

2. Multilingual aspects. There are plans to expose SKOS data for the upper levels of UDC. An HTML interface to this “UDC summary” is already available online, and includes collected translations of textual labels in many languages (see progress report) . For example, we can look up 331.4 and find (in hierarchical context) definitions in English (“Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work”), alongside e.g. Spanish (“Entorno del trabajo. Diseño del lugar de trabajo. Seguridad laboral. Higiene laboral. Accidentes de trabajo”), CroatianArmenian, …

Linked Data is about sharing work; if someone else has gone to the trouble of making such translations, it is probably worth exploring ways of re-using them. Numeric data is (in theory) linguistically neutral; this should make linking to translations particularly attractive. Much of the work around RDF and stats is about providing sufficient context to the raw values to help us understand what is really meant by “66” in some particular dataset. By exploiting SDMX-RDF’s use of SKOS, it should be possible to go further and to link out to the wider literature on workplace fatalities. This kind of topical linking should work in both directions: exploring out from numeric data to related research, debate and findings, but also coming in and finding relevant datasets that are cross-referenced from books, articles and working papers. W3C recently launched a Library Linked Data group, I look forward to learning more about how libraries are thinking about connecting numeric and non-numeric information.

RDFa in Drupal 7: last call for feedback before alpha release

Stéphane has just posted a call for feedback on the Drupal 7 RDFa design, before the first official alpha release.

First reaction above all, is that this is great news! Very happy to see this work maturing.

I’ve tried to quickly suggest some tweaks to the vocab, by hacking his diagram in photoshop. All it really shows is that I’ve forgotten how to use photoshop, but I’ll upload it here anyway.

So if you click through to the full image, you can see my rough edits.

I’d suggest:

  1. Use (dcterms) dc:subject as the way of pointing from a document to it’s SKOS subject.
  2. Use (dcterms) dc:creator as the relationship between a document and the person that created it (note that in FOAF, we now declare foaf:maker to map as an equivalentProperty to (dcterms)dc:creator).
  3. Distinguish between the description of the person versus their account in the Drupal system; I would use foaf:Person for the human, and sioc:User (a kind of foaf:OnlineAccount) as the drupal account. The foaf property to link from the former to the latter is foaf:account (new name for foaf:holdsAccount).
  4. Focus on SIOC where it is most at-home: in modelling the structure of the discussion; threading, comments and dialog.
  5. Provide a generated URI for the person. I don’t 100% understand Stephane’s comment, “Hash URIs for identifying things different from the page describing them can be implemented quite easily but this case hasn’t emerged in core” but perhaps this will be difficult? I’d suggest using URIs ending “userpage#!person” so the fragment IDs can’t clash with HTML usage.

If the core release can provide this basic structure, including a hook for describing the human person rather than the site-specific account (ie. sioc:User) then extensions should be able to add their own richness. The current markup doesn’t quite work for that end, as the human user is only described indirectly (unless  I understand current reading of sioc:User).

Anyway, I’m nitpicking! This is really great, and a nice and well-deserved boost for the RDFa community.