Essays



How it works: The Web
Originally uploaded by danbri

Or, “but what do all those links mean?”

Based on the 1994 slides by TimBL which inspired the SWAD-Europe graphics and shirt.

The twist here is just an emphasis that the giant global graph is a graph of idiosyncratic claims, and only sometimes do we all see the world the same way.

ordinary life is pretty complex stuff“ — Harvey Pekar

I’ve been using the SPARQL query language to access a very ad-hoc collection of personal and social graph data, and thanks to Bengee’s ARC system this can sit inside my otherwise ordinary Wordpress installation. At the moment, everything in there is public, but lately I’ve been discussing oauth with a few folk as a way of mediating access to selected subsets of that data. Which means the data store will need some way of categorising the dozens of misc data source URIs. There are a few ways to do this; here I try a slightly non-obvious approach.

Every SPARQL store can have many graphs inside, named by URI, plus optionally a default graph. The way I manage my store is a kind of structured chaos, with files crawled from links in my own data and my friends. One idea for indicating the structure of this chaos is to keep “table of contents” metadata in the default graph. For example, I might load up <http://danbri.org/foaf.rdf> into a SPARQL graph named with that URI. And I might load up <http://danbri.org/evilfoaf.rdf> into another graph, also using the retrieval URI to identify the data within my SPARQL store. Now, two points to make here: firstly, that the SPARQL spec does not mandate that we do things this way. An application that for example wanted to keep historical versions of a FOAF or RSS of schema document, could keep the triples from each version in a different named graph. Perhaps these might be named with UUIDs, for example. The second point, is that there are many different “meta” claims we want to store about our datasets. And that mixing them all into the store-wide “default graph” could be rather limiting, especially if we mightn’t want to unconditionally believe even those claims.

In the example above for example, I have data from running a PGP check against foaf.rdf (which passed) and evilfoaf.rdf (which doesn’t pass a check against my pgp identity). Now where do I store the results of this PGP checking? Perhaps the default graph, but maybe that’s messy. The idea I’m playing with here is that UUIDs are reasonable identifiers, and that perhaps we’ll find ourselves sharing common UUIDs across stores.

Go back to my sent-mail FOAF crawl example from yesterday. How far did I get? Well the end result was a list of URLs which I looped through, and loaded into my big chaotic SPARQL store. If I run the following query, I get a list of all the data graphs loaded:

SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s ?p ?o . } }

This reveals 54 URLs, basically everything I’ve loaded into ARC in the last month or so. Only 30 of these came from yesterday’s hack, which used Google’s new Social Graph API to allow me to map from hashed mailbox IDs to crawlable data URIs. So today’s game is to help me disentangle the 30 from the 54, and superimpose them on each other, but not always mixed with every other bit of information in the store. In other words, I’m looking for a flexible, query-based way of defining views into my personal data chaos.

So, what I tried. I took the result of yesterday’s hack, a file of data URIs called urls.txt. Then I modified my commandline dataloader script (yeah yeah this should be part of wordpress). My default data loader simply takes each URI, gets the data, and shoves it into the store under a graph name which was the URI used for retrieval. What I did today is, additionally, make a “table of contents” overview graph. To avoid worrying about names, I generated a UUID and used that. So there is a graph called <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> which contains simple asserts of the form:

<http://www.advogato.org/person/benadida/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.w3c.es/Personal/Martin/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.iandickinson.me.uk/rdf/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> . # etc

…for each of the 30 files my crawler loaded into the store.

This lets us use <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> as an indirection point for information related to this little mailbox crawler hack. I don’t have to “pollute” the single default graph with this data. And because the uuid: was previously meaningless, it is something we might decided makes sense to use across data visibility boundaries, ie. you might use the same UUID in your own SPARQL store, so we can share queries and app logic.

Here’s a simple query. It says, “ask the mailbox crawler table of contents graph (which we call uuid:320d9etc…) for all things it knows about that are a Document”. Then it says “ask each of those documents, for everything in it”. And then the SELECT clause returns all the property URIs. This gives a first level overview of what’s in the Web of data files found by the crawl. Query was:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?p WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { ?s ?p ?o . }
}

ORDER BY ?p

I’ll just show the first page full of properties it found; for the rest see link to the complete set. Since W3C’s official SPARQL doesn’t have aggregates, we’d need to write application code (or use something like the SPARQL+ extensions) to get property usage counts. Here are some of the properties that were found in the data:

  http://kota.s12.xrea.com/vocab/uranaibloodtype
http://purl.org/dc/elements/1.1/creator
http://purl.org/dc/elements/1.1/description
http://purl.org/dc/elements/1.1/format
http://purl.org/dc/elements/1.1/title
http://purl.org/dc/terms/created
http://purl.org/dc/terms/modifed
http://purl.org/dc/terms/modified
http://purl.org/net/inkel/rdf/schemas/lang/1.1#masters
http://purl.org/net/inkel/rdf/schemas/lang/1.1#reads
http://purl.org/net/inkel/rdf/schemas/lang/1.1/masters
http://purl.org/net/inkel/rdf/schemas/lang/1.1/reads
http://purl.org/net/inkel/rdf/schemas/lang/1.1/speaks
http://purl.org/net/schemas/quaffing/drankBeerWith
http://purl.org/net/schemas/quaffing/drankLagerWith
http://purl.org/net/vocab/2004/07/visit#caregion
http://purl.org/net/vocab/2004/07/visit#country
http://purl.org/net/vocab/2004/07/visit#usstate
http://purl.org/ontology/mo/hasTrack
http://purl.org/ontology/mo/myspace
http://purl.org/ontology/mo/performed

So my little corner of the Web includes properties that extend FOAF documents to include blood types, countries that have been visited, language skills that people have, music information, and even drinking habits. But remember that this comes from my corner of the Web - people I’ve corresponded with - and probably isn’t indicative of the wider network. But that’s what grassroots decentralised data is all about. The folk who published this data didn’t need to ask permission of any committe to do so, they just mixed in what they wanted to say, alongside terms more widely used like foaf:Person, foaf:name. This is the way it should be: ask forgiveness, not permission, from the language lawyers and standardistas.

Ok, so let’s dig deeper into the messy data I crawled up from my sentmail contacts?

Here’s one that finds some photos, either using FOAF’s :img or :depiction properties:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled {
{ ?x :depiction ?y1 } UNION { ?x :img ?y2 } .
}
}

Here’s another that asks the crawl results for names and homepages it found:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE { GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { { [ :name ?n; :homepage ?h ] } }
}

To recap, the key point here is that social data in a SPARQL store will be rather chaotic. Information will often be missing, and often be extended. It will come from a variety of parties, some of whom you trust, some of whom you don’t know, and a few of whom will be actively malicious. Later down the line, subsets of the data will need different permissioning: if I export a family tree from ancestry.co.uk, I don’t want everyone to be able to do a SELECT for mother’s maiden name and my date of birth.

So what I suggest here, is that we can use UUID-named graphs as an organizing structure within an otherwise chaotic SPARQL environment. The demo here shows how one such graph can be used as a “table of contents” for other graphs associated with a particular app — in this case, the Google-mediated sentmail crawling app I made yesterday. Other named views might be: those data files from colleagues, those files that are plausibly PGP-signed, those that contain data structured according to some particular application need (eg. calendar, addressbook, photos, …).

I’m digesting some of the reactions to Google’s recently announced Social Graph API. ReadWriteWeb ask whether this is a creeping privacy violation, and danah boyd has a thoughtful post raising concerns about whether the privileged tech elite have any right to experiment in this way with the online lives of those who are lack status, knowledge of these obscure technologies, and who may be amongst the more vulnerable users of the social Web.

While I tend to agree with Tim O’Reilly that privacy by obscurity is dead, I’m not of the “privacy is dead, get over it” school of thought. Tim argues,

The counter-argument is that all this data is available anyway, and that by making it more visible, we raise people’s awareness and ultimately their behavior. I’m in the latter camp. It’s a lot like the evolutionary value of pain. Search creates feedback loops that allow us to learn from and modify our behavior. A false sense of security helps bad actors more than tools that make information more visible.

There’s a danger here of technologists seeming to blame those we’re causing pain for. As danah says, “Think about whistle blowers, women or queer folk in repressive societies, journalists, etc.”. Not everyone knows their DTD from their TCP, or understand anything of how search engines, HTML or hyperlinks work. And many folk have more urgent things to focus on than learning such obscurities, let alone understanding the practical privacy, safety and reputation-related implications of their technology-mediated deeds.

Web technologists have responsibilities to the users of the Web, and while media education and literacy are important, those who are shaping and re-shaping the Web ought to be spending serious time on a daily basis struggling to come up with better ways of allowing humans to act and interact online without other parties snooping. The end of privacy by obscurity should not mean the death of privacy.

Privacy is not dead, and we will not get over it.

But it does need to be understood in the context of the public record. The reason I am enthusiastic about the Google work is that it shines a big bright light on the things people are currently putting into the public record. And it does so in a way that should allow people to build better online environments for those who do want their public actions visible, while providing immediate - and sometimes painful - feedback to those who have over-exposed themselves in the Web, and wish to backpedal.

I hope Google can put a user support mechanism on this. I know from our experience in the FOAF community, even with small scale and obscure aggregators, people will find themselves and demand to be “taken down”. While any particular aggregator can remove or hide such data, unless the data is tracked back to its source, it’ll crop up elsewhere in the Web.

I think the argument that FOAF and XFN are particularly special here is a big mistake. Web technologies used correctly (posh - “plain old semantic html” in microformats-speak) already facilitate such techniques. And Google is far from the only search engine in existence. Short of obfuscating all text inside images, personal data from these sites is readily harvestable.

ReadWriteWeb comment:

None the less, apparently the absence of XFN/FOAF data in your social network is no assurance that it won’t be pulled into the new Google API, either. The Google API page says “we currently index the public Web for XHTML Friends Network (XFN), Friend of a Friend (FOAF) markup and other publicly declared connections.” In other words, it’s not opt-in by even publishers - they aren’t required to make their information available in marked-up code.

The Web itself is built from marked-up code, and this is a thing of huge benefit to humanity. Both microformats and the Semantic Web community share the perspective that the Web’s core technologies (HTML, XHTML, XML, URIs) are properly consumed both by machines and by humans, and that any efforts to create documents that are usable only by (certain fortunate) humans is anti-social and discriminatory.

The Web Accessibility movement have worked incredibly hard over many years to encourage Web designers to create well marked up pages, where the meaning of the content is as mechanically evident as possible. The more evident the meaning of a document, the easier it is to repurpose it or present it through alternate means. This goal of device-independent, well marked up Web content is one that unites the accessibility, Mobile Web, Web 2.0, microformat and Semantic Web efforts. Perhaps the most obvious case is for blind and partially sighted users, but good markup can also benefit those with the inability to use a mouse or keyboard. Beyond accessibility, many millions of Web users (many poor, and in poor countries) will have access to the Web only via mobile phones. My former employer W3C has just published a draft document, “Experiences Shared by People with Disabilities and by People Using Mobile Devices”. Last month in Bangalore, W3C held a Workshop on the Mobile Web in Developing Countries (see executive summary).

I read both Tim’s post, and danah’s post, and I agree with large parts of what they’re both saying. But not quite with either of them, so all I can think to do is spell out some of my perhaps previously unarticulated assumptions.

  • There is no huge difference in principle between “normal” HTML Web pages and XFN or FOAF. Textual markup is what the Web is built from.
  • FOAF and XFN take some of the guesswork out of interpreting markup. But other technologies (javascript, perl, XSLT/GRDDL) can also transform vague markup into more machine-friendly markup. FOAF/XFN simply make this process easier and less heuristic, less error prone.
  • Google was not the first search engine, it is not the only search engine, and it will not be the last search engine. To obsess on Google’s behaviour here is to mistake Google for the Web.
  • Deeds that are on the public record in the Web may come to light months or years later; Google’s opening up of the (already public, but fragmented) Usenet historical record is a good example here.
  • Arguing against good markup practice on the Web (accessible, device independent markup) is something that may hurt underprivileged users (with disabilities, or limited access via mobile, high bandwidth costs etc).
  • Good markup allows content to be automatically summarised and re-presented to suit a variety of means of interaction and navigation (eg. voice browsers, screen readers, small screens, non-mouse navigation etc).
  • Good markup also makes it possible for search engines, crawlers and aggregators to offer richer services.

The difference between Google crawling FOAF/XFN from LiveJournal, versus extracting similar information via custom scripts from MySpace, is interesting and important solely to geeks. Mainstream users have no idea of such distinctions. When LiveJournal originally launched their FOAF files in 2004, the rule they followed was a pretty sensible one: if the information was there in the HTML pages, they’d also expose it in FOAF.

We need to be careful of taking a ruthless “you can’t make an omelete without breaking eggs” line here. Whatever we do, people will suffer. If the Web is made inaccessible, with information hidden inside image files or otherwise obfuscated, we exclude a huge constituency of users. If we shine a light on the public record, as Google have done, we’ll embarass, expose and even potentially risk harm to the people described by these interlinked documents. And if we stick our head in the sand and pretend that these folk aren’t exposed, I predict this will come back to bite us in the butt in a few months or years, since all that data is out there, being crawled, indexed and analysed by parties other than Google. Parties with less to lose, and more to gain.

So what to do? I think several activities need to happen in parallel:

  • Best practice codes for those who expose, and those who aggregate, social Web data
  • Improved media literacy education for those who are unwittingly exposing too much of themselves online
  • Technology development around decentralised, non-public record communication and community tools (eg. via Jabber/XMPP)

Any search engine at all, today, is capable of supporting the following bit of mischief:

Take some starting point a collection of user profiles on a public site. Extract all the usernames. Find the ones that appear in the Web less than say 10,000 times, and on other sites. Assume these are unique userIDs and crawl the pages they appear in, do some heuristic name matching, … and you’ll have a pile of smushed identities, perhaps linking professional and dating sites, or drunken college photos to respectable-new-life. No FOAF needed.

The answer I think isn’t to beat up on the aggregators, it’s to improve the Web experience such that people can have real privacy when they need it, rather than the misleading illusion of privacy. This isn’t going to be easy, but I don’t see a credible alternative.

I’ve lately started writing up and prototyping around a use-case for the “Group” construct in FOAF and for medium-sized, partially private data aggregators like SparqlPress. I think we can do something interesting to deal with the social pressure and information load people are experiencing on sites like Flickr and Twitter.

Often people have rather large lists of friends, contacts or buddys - publically visible lists - which play a variety of roles in their online life. Sometimes these roles are in tension. Flickr, for example, allow their users to mark some of their contacts as “friend” or “family” (or both). Real life isn’t so simple, however. And the fact that this classification is shared (in Flickr’s case with everyone) makes for a potentially awkward dynamic. Do you really want to tell everyone you know whether they are a full “friend” or a mere “contact”? Let alone keep this information up to date on every social site that hosts it. I’ve heard a good few folk complain about the stress of adding yet another entry to their list of “twitter follows” people.  Their lists are often already huge through some sense of social obligation to reciprocate. I think it’s worth exploring some alternative filtering and grouping mechanisms.

On the one hand, we have people “bookmarking” people they find interesting, or want to stay in touch with, or “get to know better”. On the other, we have the bookmarked party sometimes reciprocating those actions because they feel it polite; a situation complicated by crude categories like “friend” versus “contact”. What makes this a particularly troublesome combination is when user-facing features, such as “updates from your buddies” or “photos from your friends/contacts” are built on top of these buddylists.

Take my own case on Flickr, I’m probably not typical, but I’m a use case I care about. I have made no real consistent use of the “friend” versus “contact” distinction; to even attempt do so would be massively time consuming, and also a bit silly. There are all kinds of great people on my Flickr contacts list. Some I know really well, some I barely know but admire their photography, etc. It seems currently my Flickr account has 7 “family”, 79 “friends” and 604 “contacts”.

Now to be clear, I’m not grumbling about Flickr. Flickr is a work of genius, no nitpicking. What I ask for is something that goes beyond what Flickr alone can do. I’d like a better way of seeing updates from friends and contacts. This is just a specific case of a general thing (eg. RSS/Atom feed management), but let’s stay specific for now.

Currently, my Flickr email notification settings are:

  • When people comment on your photos: Yes
  • When your friends and family upload new photos: Yes (daily digest)
  • When your other contacts upload new photos: No

What this means is that I selected to see notifications photos from those 86 people who I haveflagged as “friend” or “family”. And I chose not to be notified of new photos from the other 604 contacts. Even though that list contains many people I know, and would like to know better. The usability question here is: how can we offer more subtlety in this mechanism, without overwhelming users with detail? My guess is that we approach this by collecting in a more personal environment some information that users might not want to state publically in an online profile. So a desktop (or weblog, e.g. SparqlPress) aggregation of information from addressbooks, email send/receive patterns, weblog commenting behaviours, machine readable resumes, … real evidence of real connections between real people.

And so I’ve been mining around for sources of “foaf:Group” data. As a work in progress testbed I have a list of OpenIDs of people whose comments I’ve approved in my blog. And another, of people who have worked on the FOAF wiki. I’ve been looking at the machine readable data from W3C’s Tech Reports digital archive too, as that provides information about a network of collaboration going back to the early days of the Web (and it’s available in RDF). I also want to integrate my sent-mail logs, to generate another group, and extract more still from my addressbook. On top of this of course I have access to the usual pile of FOAF and XFN, scraped or API-extracted information from social network sites and IM accounts. Rather than publish it all for the world to see, the idea here is to use it to generate simple personal-use user interfaces couched in terms of these groups. So, hopefully I shall be able to use those groups as a filter against the 600+ members of my flickr buddylist, and make some kind of basic RSS-filtered view of their photos.

If this succeeds, it may help people to have huge buddylists without feeling they’ve betrayed their “real friends” by doing so, or forcing them to put their friends into crudely labelled public buckets marked “real friend” and “mere contact”. The core FOAF design principle here is our longstanding bias towards evidence-based rather than proclaimed information.  Don’t just ask people to tell you who their friends are, and how close the relationship is (especially in public, and most especially in a format that can be imported into spreadsheets for analysis). Instead… take the approach of  “don’t say it, show it“. Create tools based on the evidence that friendship and collaboration leaves in the Web. The public Web, but also the bits that don’t show up in Google, and hopefully never will. If we can find a way to combine all that behind a simple UI, it might go a little way towards “waving not drowning” in information.

Much of this is stating the obvious, but I thought I’d leave the nerdly details of SPARQL and FOAF/RDF for another post.

Instead here’s a pointer to a Scoble article on Mr Facebook, where they’re grappling with similar issues but from a massive aggregation rather than decentralised perspective:

Facebook has a limitation on the number of friends a person can have, which is 4,999 friends. He says that was due to scaling/technical issues and they are working on getting rid of that limitation. Partly to make it possible to have celebrities on Facebook but also to make it possible to have more grouping kinds of features. He told me many members were getting thousands of friends and that those users are also asking for many more features to put their friends into different groups. He expects to see improvements in that area this year.

ps. anyone know if Facebook group membership data is accessible through their APIs? I know I can get group data from LiveJournal and from My Opera in FOAF, but would be cool to mix in Facebook too.

This just arrived in my mailbox, sneaking past my spam filters. While it advertises the most horrible spamsite, the words are strangely hypnotic…

Hei,
Inncrease your S.[E].X.U.AL health!

Sane, recalled me from these fantastic speculations. Miss

believer, please tell me in your own words do without them.

footnote: ‘a description of nova friar, sent an arrow after

the flying sheriff, of uranium deposits in tanganyika, of

the body.

A fair few people have been asking about FOAF exporters from Facebook. I’m not entirely sure what else is out there, but Matthew Rowe has just announced a Facebook FOAF generator. It doesn’t dump all 35 million records into your Web browser, thankfully. But it will export a minimal description of you and your Facebook associates. At the moment, you get name, a photo URL, and (in this revision of the tool) a Facebook account name using FOAF’s OnlineAccount construct.

As an aside, this part of the FOAF design provides a way for identifiers from arbitrary services to be described in FOAF without special-purpose support. Some services have shortcut property names, eg. msnChatID and we may add more, but it is also important to allow this kind of freeform, decentralised identification. People shouldn’t have to petition the FOAF spec editors before any given Social Network site’s IDs can be supported; they can always use their own vocabulary alongside FOAF, or use the OnlineAccount construct as shown here.

I’ve saved my Facebook export on my Web site, working on the assumption that Facebook IDs are not private data. If people think otherwise, let me know and I’ll change the setup. We might also discuss whether even sharing the names and connectivity graph will upset people’s privacy expectations, but that’s for another day. Let me know if you’re annoyed!

Here is a quick SPARQL query, which simply asks for details of each person mentioned in the file who has an account on Facebook.


PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?name, ?pic, ?id
WHERE {
[ a :Person;
:name ?name;
:depiction ?pic;
:holdsAccount [ :accountServiceHomepage <http://www.facebook.com/> ; :accountName ?id ]
]
}
ORDER BY ?name

I tested this online using Dave Beckett’s Rasqal-based Web service. It should return a big list of the first 200 people matched by the query, ordered alphabetically by name.

For “Web 2.0″ fans, SPARQL’s result sets are essentially tabular (just like SQL), and have encodings in both simple XML and JSON. So whatever you might have heard about RDF’s syntactic complexity, you can forget it when dealing with a SPARQL engine.

Here’s a fragment of the JSON results from the above query:


{
"name" : { "type": "literal", "value": "Dan Brickley" },
"pic" : { "type": "uri", "value": "http://danbri.org/yasns/facebook/danbri-fb.rdf" },
"id" : { "type": "literal", "value": "624168" }
},
{
"name" : { "type": "literal", "value": "Dan Brickley" },
"pic" : { "type": "uri", "value": "http://profile.ak.facebook.com/profile5/575/66/s501730978_7421.jpg” },
“id” : { “type”: “literal”, “value”: “501730978″ }
}, …

What’s going on here? (a) Why are there two of me? (b) And why does it think that one of us has my Facebook FOAF file’s URL as a mugshot picture?

There’s no big mystery here. Firstly, there’s another guy who has the cheek to be called Dan Brickley. We’re friends on Facebook, even though we should probably be mortal enemies or something. Secondly, why does it give him the wrong URL for his photo? This is also straightforward, if a little technical. Basically, it’s an easily-fixed bug in this version of the FOAF exporter I used. When an image URL is not available, the convertor is still generating markup like “<foaf:depiction rdf:resource=”"/>”. This empty URL is treated in RDF as the extreme case of a relative link, ie. the same kind of thing as writing “../../images/me.jpg” in a normal Web page. And since RDF is all about de-contextualising information, your RDF parser will try to resolve the relative link before passing the data on to storage or query systems (fiddly details are available to those that care). If the foaf:depiction property were simply ommitted when no photo was present, this problem wouldn’t arise. We’d then have to make the query a little more flexible, so that it still matched people even if there was no depiction, but that’s easy. I’ll show it next time.

I mentioned a couple of days ago that SPARQL is a query language with built-in support for asking questions about data provenance, ie. we can mix in “according to Facebook”, “according to Jabber” right into the WHERE clause of queries such as the one I show here. I’m not going to get into that today, but I will close with a visual observation about why that is important.

yasn map, borrowed from data junk, valleywag blog
To state the obvious, there’ll always be multiple Web sites where people hang out and socialise. A friend sent me this link the other day; a world map of social networks (thumbnail version copied here). I can’t vouch for the science behind it, but it makes the point that we risk fragmenting Web communities on geographic boundaries if we don’t bridge the various IM and YASN networks. There are lots of ways this can be done, each with different implications for user experience, business model, cost and practicality. But it has to happen. And when it does, we’ll be wanting ways of asking questions against aggregations from across these sites…

OK so I just stumbled upon this…

Bomb in Baghdad

…via Jonathan Chetwynd’s ever-inventive and SVG-happy Peepo.com.

The “Car Bomb in Baghdad” story is from a site created by Widgit Software, and explains itself as follows:

Symbolworld has been set up to provide a web site with material suitable for symbol readers of all ages. The internet is an important medium which many people really like to use. Sadly there is very little material that is appropriate or accessible by people with learning difficulties.

The copyright statement for Symbolworld says “symbols on this page are copyright of the commercial owners. They may not be copied or used in any other format without the written permission from the owner.“, which initially struck me as a potential challenge to any use of this particular symbol-set for online communication. But I don’t really know this scene, and I guess this copyright could be just the same as the way eg. fonts are copyrighted.

So I don’t know much about this particular project/company/product, but it reminded me of some similar work I heard about a few years ago. Back when the EU, in their occasionally infinite wisdom, funded SWAD-Europe to run around talking to interesting people about standards and the Semantic Web (and giving them t-shirts) , Chaals organised a great developer workshop in Madrid on Image annotation. We had the usual fun with SVG and RDF (which btw I’m still betting on) and I got to learn a bit about CCF. Seeing the Baghdad example this morning reminded me of all this. I’ve been clicking around and trying to gather my sprawling thoughts.

CCF, the Concept Coding Framework, is kinda image annotation in reverse. Instead of focussing on the description of the content, concepts etc associated with images, the emphasis is on the use of images to illustrate some enumerated set of concepts. From Chaals’ workshop report re outcomes, I’m reminded that we discussed…

How to use Creative Commons and similar vocabularies to determine whether a particular symbol can be freely used (typically in commercial systems the symbols themselves are proprietary, which can be a major barrier to communication between people who have different systems).

CCF was using some variant of SKOS (another SWAD-Europe activity). This found its way into SKOS itself, where we now have prefSymbol and altSymbol relationships that associate a skos:Concept with a dcmitype:Image. Borrowing an illustration from the SKOS Core guide here:

skos symbol diagram

The guide also notes a distinction between symbolic labelling and “depiction” in the FOAF sense; some symbols are purely symbolic, and have no representational content.

So, catching up with this area of work a little, I find the Bliss Symbolics, the WWAAC project final report, and various other accessibility-related efforts. But I’ve not really figured out where I’d start if I wanted to build something simple using a freely available symbol-set, nor what the main options/projects currently are. But there’s plenty of reading, including pages from a recent Bliss “think tank” meeting.

The latest I can find on CCF is that it has moved sites and that there are some Web interfaces available, but “sorry - there are no downloads for this project yet.”. Ah, apparently the SYMBERED project is continuing development of CCF (aside: Bliss with swedish translation; doubly incomprehensible to me!). There is a nice example on their site showing the multilingual aspect to this work, as well as contrasting Bliss with a more representationally-oriented symbol set; see their site for details.

Here’s a simple example just in English:

I want coffee and milk and cookies.

In case anyone thinks this whole exploration is a bit niche and obscure, take a look at how people use MSN and other IM systems sometime. And of course the Jabber/XMPP guys have been exploring specs for standard emoticons. Chaals also points out some connections to VoiceXML where “there are a handful of options available in an interaction designed to be through voice, and developers will define assorted ways of recognising from a user’s speech which of the relevant concepts is being matched.”

We’re also only a few clicks away from the survey conducted by the dreamily named W3C Emotion incubator group into use cases and technologies for the description of emotion using markup languages.

If an RDF-ized Wordnet is also thrown into the mix, assigning URIs to Synsets, WordSenses, Words, I think we might actually be getting somewhere. The version of Wordnet in RDF published at W3C doesn’t currently use SKOS, although this was discussed, and of course there are other representations that make more use of RDF class hierarchies for nouns (at the expense of linguistic lossyness and losing non-noun content). Princeton’s original English-language Wordnet has spawned many related projects and translations, but as far as I know, there is little integration amongst them, and not all of the data is public or freely re-usable.

I once had a silly dream of taking a photo to go with each and every noun term in Wordnet. A kind of SemWeb I-Spy, a cousin to Immuexa’s FOAF bingo. Or better, of doing that with friends. The rise of Flickr and tagging means that we’d probably do this now by aggregate using Flickr and similar sites. But it seems conceivable to me that such an “illustrated wordnet” could be made, using either photo-oriented or symbol-oriented illustrations.

OK what might that buy us? Let’s try these two samples.

Not perfect, … but imho Wordnet gives a nice set of common and identifiable concepts that can be used as a hub for all kinds of different projects. And all we’d need is a huge pile of shared data shaped like this:

wordnet:word-cookie skos.prefLabel wikicommons:Explosion.svg .

OK it’s not going to bring about world peace and the return of esperanto, and of course there’s much more to language and communication than nouns and verbs (the easiest part of wordnet to turn into visual symbols), but it does strike me as a fun little (big) project…. Wordnet is too huge to be useful in every context where we’d want a modest-sized symbol set (eg. IM emoticons), … but it is nicely searchable, and would provide a framework for such subsets to evolve and be interconnected.

Facebook in many ways is pretty open for a ’social networking’ site. It gives extension apps a good amount of access to both data and UI. But the closed world language employed in their UI betrays the immodest assumption “Facebook knows all”.

  • Eric Childress and Stuart Weibel are now friends with Charles McCathienevile.
  • John Doe is now in a relationship.
  • You have 210 friends.

To state the obvious: maybe Eric, Stu and Chaals were already friends. Maybe Facebook was the last to know about John’s relationship; maybe friendship isn’t countable. As the walls between social networking sites slowly melt (I put Jabber/XMPP first here, with OpenID, FOAF, SPARQL and XFN as helper apps), me and my 210 closest friends will share fragments of our lives with a wide variety of sites. If we choose to make those descriptions linkable, the linked sites will increasingly need to refine their UI text to be a little more modest: even the biggest site doesn’t get the full story.

Closed World Assumption (Abort/Retry/Fail)
Facebook are far from alone in this (see this Xbox screenshot too, “You do not have any friends!”); but even with 35M users, the mistake is jarring, and not just to Semantic Web geeks of the missing isn’t broken school. It’s simply a mistake to fail to distinguish the world from its description, or the territory from the map.

A description of me and my friends hosted by a big Web site isn’t “my social network”. Those sites are just a database containing claims made by different people, some verified, some not. And with, inevitably, lots missing. My “social network” is an abstractification of a set of interlinked real-world histories. You could make the case that there has only ever been one “social network” since the distant beginnings of human society; certainly those who try to do geneology with Web data formats run into this in a weaker form, including the need to balance competing and partial information. We can do better than categorised “buddylists” when describing people, their inter-connections and relationships. And in many ways Facebook is doing just great here. Aside from the Pirates-vs-Ninjas noise, many extension applications on Facebook allow arbitrary events from elsewhere in the Web to bubble up through their service and be seen (or filtered) by others who are linked to me in their database. For example:

Facebook is good at reporting events, generally. Especially those sourced outside the system. Where it isn’t so great is when reporting internal-events, eg. someone telling it about a relationship. Event descriptions are nice things to syndicate btw since they never go out of date. Syndicating descriptions of the changeable properties of the world, on the other hand, is more slippery since you need to have all other relevant facts to be able to say how the world is right now (or implicitly, how it used to be, before). “Dan has painted his car red” versus “Dan’s car is now red”. “Dan has bookmarked the Jabber user profile spec” versus “Dan now has 1621 bookmarks”. “Dan has added Charles to his Facebook profile” versus “Dan is now friends with Charles”.

We need better UI that reflects what’s really going on. There will be users who choose to live much of their lives in public view, spread across sites, sharing enough information for these accounts to be linked. Hopefully they’ll be as privacy-smart and selective as Pew suggests. Personas and ‘characters’ can be spread across sites without either site necessarily revealing a real-world identity; secrets are keepable, at least in theory. But we will see people’s behaviour and claims from one site leak into another, and with approval. I don’t think this will be just through some giant “social graph” of strictly enumerated relationships, but through a haze of vaguer data.

What we’re most missing is a style of end-user UI here that educates users about this world that spans websites, couching things in terms of claims hosted in sites, rather than in absolutist terms. I suppose I probably don’t have 210 “friends” (whatever that means) in real life, although I know a lot of great people and am happy to be linked to them online. But I have 210 entries in a Facebook-hosted database. My email whitelist file has 8785 email addresses in it currently; email accounts that I’m prepared to assume aren’t sending me spam. I’m sure I can’t have 8785 friends. My Google Mail (and hence GTalk Jabber) account claims 682 contacts, and has some mysterious relationship to my Orkut account where I have 200+ (more randomly selected) friends. And now the OpenID roster on my blog gives another list (as of today, 19 OpenIDs that made it past the Wordpress spam filter). Modern social websites shouldn’t try to tell me how many friends I have; that’s just silly. And they shouldn’t assume their database knows it all. What they can do is try to tell me things that are interesting to me, with some emphasis on things that touch my immediate world and the extended world of those I’m variously connected to.

So what am I getting at here? I guess it’s just that we need these big social sites to move away from making teen-talk claims about how the world is - “Sally (now) loves John” - and instead become reflectors for the things people are saying, “Sally announces that she’s in love with John”; “John says that he used to work for Microsoft” versus “John worked for Microsoft 2004-2006″; “Stanford University says Sally was awarded a PhD in 2008″. Today’s young internet users are growing up fast, and the Web around them needs also to mature.

One of the most puzzling criticisms you’ll often hear about the Semantic Web initiative is that is requires a single universal truth, a monolithic ontology to model all of human knowledge. Those of us in the SW community know that this isn’t so; we’ve been saying for a long time that our (meta)data architecture is designed to allow people to publish claims “in which
statements can draw upon multiple vocabularies that are managed in a decentralised fashion by various communities of expertise.”
As the SemWeb technology stack now has a much better approach to representing data provenance (SPARQL named graphs replacing RDF’99 statement reification) I believe we should now be putting more emphasis on a related theme: Semantic Web data can represent disputes, competing claims, and contradictions. And we can query it in an SQL-like language (SPARQL) that allows us to ask questions not just of some all-knowing database, but about what different databases are telling us.

The closed world approach to data gives us a lot, don’t get me wrong. I’m not the only one with a love-hate relationship with SQL. There are many optimisations we can do in a traditional SQL or XML Schema environment which become hard in an RDF context. In particular, going “open world” makes for a harder job when hosting and managing data rather than merely aggregating and integrating it. Nevertheless, if you’re looking for a modern Web data environment for aggregating claims of the “Stanford University says Sally was awarded a PhD in 1995″ form, SPARQL has a lot to offer.

When we’re querying a single, all-knowing, all-trusted database, SQL will do the job (eg. see Facebook’s FQL for example). When we need to take a bit more care with “who said what” and “according to whom?” aspects, coupled with schema extensibility and frequently missing data, SQL starts to hurt. If we’re aggregating (and building UI for) ’social web’ claims about the world rather than simple buddylists (which XMPP/Jabber gives us out of the box), I suspect aggregators will get burned unless they take care to keep careful track of who said what, whether using SPARQL or some home-grown database system in the same spirit. And I think they’ll find that doing so will be peculiarly rewarding, giving us a foundation for applications that do substantially more than merely listing your buddies…

So I meant to write about a 1-line piece of Javascript, but ended up with a 5000 word freeform essay on the nature of RDF, XML, validation and so forth. It could probably do with some editing, but for now the words are in pretty much the order they came out of my brain. A short summary: thinking about our expectations of RDF “validation” can teach us a lot about RDF’s value, about it’s relationship to XML, and about the things we should focus on building next.

I’ve just made a Javascript “favelet” for checking documents against the RDF/XML syntax. It uses the W3C RDF Validator, which in turn uses the ARP RDF parser from Jena.

The Javascript: CheckRDFSyntax

I have made some assumptions about the options passed to the validator; these are easily adjusted by looking at the source of the normal submission page. It currently asks for ‘N-Triples’ syntax back (this is more compact than the default tabular view), and also has the ‘rdf:RDF is omitted’ flag set, which is useful for checking documents who indicate their RDF-ness in some other way (eg. DOAP files with a content-type of ‘application/rdf+xml’).

The favelet (aka ‘bookmarklet’) exploits the ability to send ‘GET’ requests to the RDF validator. The current main form uses ‘POST’; it’s possible that GETs might be disallowed in the future, eg. for server-load issues. I have set the Javascript to not ask for images; it is probably best to leave things that way, since to do otherwise could overload this (free and useful) service.

Some more Favelets for the W3C Markup Validation Service are also available. If people find this one useful, I’ll see about getting it linked from that page and from the RDF validator itself.

Why call it “Check RDF syntax” instead of “validate RDF”, you might ask?

There’s a long answer and a short answer. I intended to keep this short, but failed. Basically, the word “validate” is terribly overloaded, particularly when we compare what it means in XML with the world of RDF. RDF users all too frequently ask for the ability to “validate” a document against an RDF schema, when what they often want is one of several things. Explaining the difference is not the easiest of things to do…

When someone wants to “validate RDF”, what could this mean? They may want to “check that it’s OK” against the rules of RDF/XML itself, or they may want to make sure — in some hard to articulate way — that it doesn’t violate anything said by the RDF schemas (aka “vocabularies”, “ontologies”, etc.) used in the document. Either way, they want some kind of sanity check, ie. “did I screw up?”.

Checking against the RDF/XML grammar is what I’m calling “RDF syntax checking”, and what W3C’s RDF Validation Service is all about. RDF syntax validation only makes sure that your XML is the right shape for sending RDF graphs around, eg. that you’ve used rdf:resource and rdf:about attributes in the right places, that the element nesting patterns are correct, etc. We might call this “RDF/XML-wellformed” by analogy with XML’s own notion of well-formedness. For each concrete RDF syntax, eg. RDF/XML, N3, N-Triples, XHTML2’s RDF/A etc., there would be a different version of syntax checking, to make sure that the document maps into RDF’s abstract graph structures.

The other concept of validation draws us into thinking about the nature of the Semantic Web, and about the differences between RDF and XML.

RDF documents (and their schemas) are about making simple claims about the world, structured in terms of classes (ie. categories) and relationships/properties. XML schemas (W3C XML Schema, DTDs, and others) do something similar, but they do it indirectly, by making statements about XML element structures for describing things in the world. The difference is subtle.

When an RDF schema has markup “defining” something like eg:ShippingAddress, it is talking about a class of thing-in-the-world. The RDF schema (or OWL ontology; OWL extends RDFS) expresses some generalisations about those things in the world that are shipping addresses.

When an XML schema has markup for eg:ShippingAddress it looks at first glance the same, when in fact something utterly different is going on. The XML schema is expressing some generalisations about XML document structures. It is telling us some rules for checking XML documents against the schema, expressed in terms of XML element containment, allowed attribute structures, and relationship to datatypes. It says “if your document has one of these here, and two of those there, and and no thingybobs inside a such’n’so, then it is valid as far as I’m concerned”. In other words, it provides clear, machine-checkable rules for testing whether a document (or document-subsection) falls into some useful category. You can therefore validate an XML document against such schemas, to find out if you’ve missed some essential information, or if you’ve over-enthusiastically included more information than the schema-designer was expecting (”Hey, this is a eg:ShippingAddress, what’s that geo:lat, geo:long, photo:Image and foaf:aimChatID doing in there? Invalid!”).

RDF is not like that. You can’t easily do this kind of validation in RDF. RDF schemas don’t care what information you choose to include in some document, nor what other forms of information you mix it with. RDF is pretty mellow about all that, an attitude which can by turns be liberating and infuriating, depending on what you’re trying to do.

In RDF, missing isn’t broken. You can, from the point of view of an RDF schema or OWL ontology, always omit stuff from your RDF/XML documents. They don’t care, because RDF schemas express claims about the world, and not about the XML documents that describe that world. They say things like primaryAuthor is a relationship between a Document and an Agent; the don’t say anything about syntax, nor about how much information anybody ought to provide about documents, agents or their interelationships. The authors of XML schemas get to say such things; authors of RDF schemas don’t. This is neither good nor bad, just different. It’s a difference grounded in the essential differences between RDF and XML.

Pedantic aside: are XML documents not “in the world” too? Indeed so. We could imagine an RDF/OWL ontology that defined classes called things such as “Element”, “Attribute”, and other core concepts from XML. In fact this work has been done already; see the RDF Schema for the XML Infoset. It hasn’t been updated to use OWL, though, so it doesn’t capture many of the core generalisations about XML document structure. See also an early XML schema proposal, Document Content Description for XML. This was “an RDF vocabulary designed for describing constraints to be applied to the structure and content of XML documents”. Note that those constraints were largely invisible to RDF itself; RDF in DCD was a carrier for information about XML documents and sub-classes of XML document, but generic RDF/OWL tools wouldn’t be able to interpret such descriptions and constraints to classify documents into XML-valid or XML-invalid.

Getting back on topic: RDF schemas make generalisations about the world, XML schemas make generalisations about XML document types. In this way, most XML schema languages can be thought of as a special-purpose “ontology language” optimised for dealing with the domain of XML documents.

How does this relate to validation?

The concept of validation in the XML world is all about checking whether some input is (a) well-formed XML (b) structured according to the XML element/attribute rules of some XML schema. Does it have the right bits of information, in the right places, the right order? Are there any extra bits where there shouldn’t be? And so on.

The concept of validation in the RDF world is necessarily more permissive. First, just as in XML, we check basic syntax. In fact for XML-based RDF notations, like RDF/XML and XHTML2’s RDF/A, we do a lot of the same checking, since the same rules apply. But once we know we have an RDF graph, ie. a representation of a set of subject/predicate/object triples which make simple statements about the world, … what do we do next?

This is where the “liberating and infuriating” duality comes in. Freedom, horrible freedom. Once you’re past the “yup, it’s an RDF graph” stage, it isn’t always clear what kind of machine-checking to do next. The expectations we have from the world of XML schema (as well as from some OO-notions of class hierarchy modelling, and other sources) encourage us to look for a way to “validate” our RDF/XML documents against application schemas. So for example, we might think that we want to check a document against the RDF schema for Dublin Core, or FOAF, or RSS1, or MusicBrainz, Creative Commons, etc.

What could an RDF system do, to check that you’ve not screwed up when writing documents that uses these schemas? Basically, all it can do is look for contradictions, ie. where you make statements about the world that simply couldn’t be so. It might, for example, remind you that you’re saying that a xyz:Document has a geo:lat of “52.1″, yet the domain of geo:lat is geo:SpatialThing, and xyz:Document and geo:SpatialThing are marked as owl:disjointWith each other. In other words, that what you’re saying doesn’t fit with the meaning of the terms given in the geo: and xyz: schemas. By ascribing a geo:lat to a document you are implicitly claiming that it is a spatial thing, yet the claims in the schema disagree with this, since they use RDF/OWL to claim that nothing can be both an xyz:Document and a geo:SpatialThing at the same time.

This is a pretty simple example; things become more compelling as schemas get more complex. This sort of checking is useful in schema/ontology design, as much as for checking of instance documents: it is easy for a schema to embody a conceptual confusion. The community around W3C’s OWL ontology language have a lot of scientific know-how in this area - eg. checking huge, complex ontologies for mistake. Think about the potential for error when reasoning about aircraft parts, or in the lifesciences.

There are some “OWL Validators” out there that can do this sort of checking. For eg see Mindswap’s Online OWL consistency checker, built using their Pellet reasoner. There’s also a similar validator at Manchester.
BBN’s OWL validator is more concerned with checking the abstract (RDF-encoded) syntax of OWL, ie. it does not do full inference.

How useful is this sort of logical checking for simpler metadata applications? eg. RSS for data syndication, Dublin Core, FOAF etc. Well, it is a start. But it barely scratches the surface of what could be built on top of RDF. There are so many ways we can screw up our data, and only some of those are manifested as machine-checkable in the above manner. There are other forms of bad data than those that make logical errors.

For example, we could write (as many do), dc:author instead of dc:creator. That term isn’t defined by Dublin Core. Or we could spell a namespace wrong (I had http://xmlns.com/1.0/foaf/ instead of http://xmlns.com/0.1/foaf/ in the FOAF schema itself for a while; I found the mistake last week while using one of the OWL validators above). Checking for those kinds of errors is useful, and some RDF tools (eg. Cwm, Jena Eyeball) are increasingly offering more “lint”-style facilities. This trend is important, and a huge thing for RDF usability and deployment.

There is, however, another form of “RDF checking” that deserves much more attention and research, and which I’ll even dare to claim, may prove critical to getting widespread adoption of Semantic Web technology in the public Web.

This is the idea of checking our RDF/XML documents against descriptive patterns that capture application-specific information needs. In the Dublin Core community, these are called “application profiles”. In the XML world, Rick Jelliffe’s excellent Schematron system has led the way. The idea, roughly, is that real-world applications often have information needs that are not expressed in schema definitions, since they are not shared by all users of the schema. This is a natural side effect of the admirable urge to use common schemas (whether XML or RDF) across the globe, as well as an acknowledgement that documents and data aren’t static, but part of complex lifecycle in which different checks are appropriate in different environments. In the RDF world, such checking is also important, since we don’t have XML’s native concern for “missing” or “unexpected” chunks of data. We just have a graph, that is an unordered set of triples whose meaning is governed largely by schema-dictionary definitions of the property and class names used.

As Edd Dumbill put it,

Processing RDF is therefore a matter of poking around in this graph. Once a program has read in some RDF, it has a ball of spaghetti on its hands. You may like to think of RDF in the same way as a hashtable data structure — you can stick whatever you want in there, in whatever order you want.

Such liberating flexibility! Semi-structured chaos! How can on earth programmers be expected to deal with it? What OO coder would replace their nicely organised structures with a hashtable?

I’m always amused when RDF and the Semantic Web are misrepresented as an exercise in formalistic centralisation, as promoting an ivory tower “one perfect ontology for the planet” and so on. True, we have our formalists, and I for one am eternally grateful to them for the huge amount of work the formal KR guys put into cleaning up W3C’s RDF specs. But RDF, if conceived of as a naive attempt to create a machine-readable theory of everything, is tragically misunderstood. RDF is a strategy for principled decentralisation in a world where unanticipated data re-use, unanticipated data extensions, are valued. It is anything but centralised.

RDF says, to schema authors: “don’t model how you think your data should be written in XML, model its assumptions about the world”. It says, “don’t tell me what can go inside a workplaceHomepage tag; tell me what kinds of things are related by the workplaceHomepage property”. And to the authors of RDF/XML documents it says, “I won’t pretend to know your information needs better than you; put any RDF statements into the graph that you think are meaningful, useful, affordable and shareable”.

For this to work, it takes away some choices from schema authors. There are all kinds of assumptions that can accompany XML schemas, which RDF takes away from the authors of RDF schemas. Most obviously, it imposes a particular syntax. In RDF/XML (one of many concrete RDF notations) we have all those rdf:about and rdf:resource attributes, alongside that sometimes-verbose “striped” notation. So schema authors, as they move from XML to RDF, give up their right to decide what XML tag structures to impose on their own users. They let RDF do that, and trust the RDF community to keep coming up with new and better notations that can be shared across all such schemas. Currently we have N3, the evolving XHTML2 RDF/A work and GRDDL as evidence that the RDF community take this responsibility seriously.

What else do schema authors give up? They give up their right to make rules couched in terms of data being missing versus present; defaults, for example. RDF doesn’t impose those kinds of restriction on the creators of RDF documents.

For example, looking at SMBmeta from Dan Bricklin (great name!). The smbmeta spec tells us:

The “country is assumed to be “us” if omitted

For better or worse, you simply couldn’t say that in an RDF schema. To attempt to do so would be to misunderstand RDF entirely. RDF applications, given some other background knowledge, might in certain circumstances be justified in making that conclusion. But you can’t say “anybody who uses my schema to describe a Business, but omits the country code, is implicitly saying that the business is in the USA”.

Why? RDF is designed for open-world, data sharing apps, where content is syndicated, re-syndicated, merged, separated, shredded and glued back together. A single document might draw on a dozen different vocabularies, mixed tightly together at the elements and attributes level. It would be completely impractical for applications to have to read and understand the corresponding schemas, their idiosyncratic defaulting rules, and the interactions between those rules across schemas.

To make something like SMBmeta markup play nicely in the RDF world, we need to do more than simply analyse its XML structures and come up with a corresponding structure of RDF class and property names. We need to look deeper at the assumptions in the data, so that we don’t trample over the intended meaning of the XML when bringing it into the RDF environment. I took a look at making an SMBmeta file into RDF yesterday. More on that another time. The file got a bit uglier — the so-called RDF tax — to which it is tempting to say, “oh, we’ll just use a different RDF syntax”. But there’s a deeper issue, intimately bound up with the idea of document validity and data checking discussed here.

By moving a format to use RDF, we restrict the kinds of things a schema author can demand of his or her user. In RDF, you don’t get to say “if country is missing, assume USA”. This is why I consider GRDDL (the use of XSLT to go from non-RDF XML into RDF/XML) to be only part of the solution to the “Cambridge Communiqué problem of better connecting RDF to XML. So whenever we try to RDFize some XML format, we have to look very carefully for these kinds of assumptions, since they don’t play well in the RDF data-mixing world. The Atom syndication format is another example I’ve been looking at. It turns out that we can construct a document that meets the syntactic rules of both Atom and RDF/XML, ie. there is a profile of Atom could be used directly as RDF (see syntax check which will gripe about namespace prefixes). But before we rush around celebrating, we need to check the Atom spec very carefully, to see if the meaning of Atom terms makes sense in the RDF world (eg. defaulting rules — would RDF applications miss out on extra data? that’s bearable; would they misunderstand Atom data? that’s not).

What other constraints do we take away from XML schema authors? Here’s a big one: attaching meaning to XML element order. RDF graphs, like SQL relational tables, are unordered. This is because they correspond to logical assertions about the world. The importance of this, when trying to understand RDF, simply can’t be overestimated.

A couple of quick examples: in the RSS 1.0, we wanted the RDF content to include an ordered list of the items described in the feed. Because RDF doesn’t preserve XML element ordering after you’ve parsed the document, we had to find a way of putting that ordering information into the graph. The design we chose was to have an rdf:Seq structure at the top of the document. More recently, I have this week been looking into the possibilities for reflecting GML into RDF. GML syntax looks a lot like RDF already, and I was able to make an example into parseable RDF very easily. However, GML often needs to describe ordered lists of points (eg. for polygons on a map). When I looked more closely at the RDFization design, I realised I had created meaningless data, since the RDF statements didn’t preserve the XML element ordering information that GML uses to link a set of points into a line. The following markup looks like XML/RDF, but doesn’t preserve enough information. There are other designs (eg. using RDF collections, or packing data into microsyntax-formatted strings) that need to be explored.


<Railway>
<gml:name>Track 29</gml:name>
<centerLine>
<gml:LineString>
<gml:pos>100 100</gml:pos>
<gml:pos>200 200</gml:pos>
<gml:pos>300 300</gml:pos>
</gml:LineString>
</centerLine>
</Railway>

The exercise of “RDFizing” an XML format consists of more than mapping from a tree-based data structure into an edge-labelled graph one. We also have to go through and find out how the original format deals with things like missing data, defaulting rules, and the use of XML element order to carry meaning. Once in the RDF world, all vocabularies behave the same; we take some flexibility and choice away from schema authors, so that end-user applications can enjoy a different kind of flexibility and choice.

It is important to RDF that element order information can (usually) be thrown away. This design allows us to grab data from multiple sources, and integrate it relatively cheaply. But it is also grounded in the workings of logic and human communication. Consider that we can say in English something like:

There’s a person whose name is “Dan Brickley”, whose birthday is January 9th, and who works for the organisation which has a homepage identified by the URI <http://www.w3.org/> and which has a title “World Wide Web Consortium”.

Instead of that, we could have said something a bit different, but which would be true or false of the world under exactly the same circumstances:

A thing with a title of “World Wide Web Consortium”, identified by the URI <http://www.w3.org/> is a homepage of an organisation that a person born on January 9th called “Dan Brickley” works for.

Machines are really pretty bad at natural language parsing, because they lack commonsense. They can’t understand the vast hidden context behind our utterances. RDF won’t fix that. Ever. It is a design for data sharing in a world where machines have these terrible limitations.

Sometimes it makes sense to adjust our human practices to make things easier on computers, when we want the computers to do things we’d rather not do ourselves. So for example, we simplify our handwriting, so our PDAs can recognise the letters in our handwriting. RDF is a simplification of the practice of making basic claims about the world. The above two English-prose paragraphs are true in exactly the same circumstances, and false in exactly the same circumstances, regardless of the re-ordering of the claims.

W3C went to great lengths to make sure that RDF shares these logical characteristics, since it is fundamental to data-sharing on a global scale (which is what W3C is all about). Let’s see the RDF version written in XML:


<Person xmlns="http://xmlns.com/foaf/0.1/"
xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:dc=”http://purl.org/dc/elements/1.1/”>
<name>Dan Brickley</name>
<birthday>01-09</birthday>
<workplaceHomepage>
<Document rdf:about=”http://www.w3.org/”>
<dc:title>World Wide Web Consortium</dc:title>
</Document>
</workplaceHomepage>
</Person>

Here’s another way we might have written those exact same 6 statements:


<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns=”http://purl.org/dc/elements/1.1/”
xmlns:foaf=”http://xmlns.com/foaf/0.1/”>
<foaf:Document rdf:about=”http://www.w3.org/”>
<title>World Wide Web Consortium</title>
</foaf:Document>
<foaf:Person>
<foaf:workplaceHomepage rdf:resource=”http://www.w3.org/”/>
<foaf:birthday>01-09</foaf:birthday>

<foaf:name>Dan Brickley</foaf:name>
</foaf:Person>
</rdf:RDF>

In both cases, RDF sees only the underlying 3-part statements. Six of them:

  • _:x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
  • _:x <http://xmlns.com/foaf/0.1/name> “Dan Brickley” .
  • _:x <http://xmlns.com/foaf/0.1/birthday> “01-09″ .
  • <http://www.w3.org/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document> .
  • <http://www.w3.org/> <http://purl.org/dc/elements/1.1/title> “World Wide Web Consortium” .
  • _:li <http://xmlns.com/foaf/0.1/workplaceHomepage> <http://www.w3.org/> .

The RDF Semantics spec explains the maths behind this far more carefully and cleverly than I could ever manage. But the idea is simple. Just as the two paragraphs of English prose are equivalent, regardless of statement ordering, so are the bits of RDF.

These RDF statements are no more intrinsic, meaningful order than the rows in a relational database, or the files in a directory on your computer. They might, of course, be sorted on various criteria. But the schemas used (in this case, Dublin Core and FOAF), like all RDF schemas, do not impose application-specific meaning on the ordering of the XML elements. To risk over-emphasising a point, neither do the creators of Dublin Core, or of FOAF, get to make up rules for what happens if data is missing from the graph. Data costs time and money to manage, and there are thousands of reasons why data might usefully be missing from an RDF graph. RDF pushes such concerns down into the application layer: if you have an application which wants the birthdays of all employees to be listed, that’s your own business. It is a separate problem from that of defining a shared markup for birthdays, or for representing employment.

So let’s look again at that graph of 6 statements, but as a diagram, since that emphasises the unordered nature of the data (follow link for full-size image):

6 rdf statements in a graph

As you can see, the RDF abstraction normalises both of the XML/RDF forms given above into a single structure. This is another of those “love it or hate it” features of RDF. Developers sometimes complain that RDF has too many ways of writing the same thing. Flipped around, this is a feature: RDF provides an account of what seemingly diverse descriptions have in common.

Consider an application that was trying to collect the birthdays of W3C team members, perhaps to output in iCalendar format, or to link to our Amazon wishlists and suchlike. The application doesn’t need to care about lots of things. It doesn’t care if it loads data from RDF/XML, or N3, or XHTML2/RDFA, or from a GRDDL-based XSLT transformation. All it cares about is scooping up some statements about people, their “workplace homepage”, and their names and birthdays. In this simple example, all of that information can be expressed using a single RDF vocabulary, FOAF. But it is essential to realise that the RDF application’s information needs could very easily have drawn upon other RDF vocabularies (eg. it might be looking for birthdays of people who work for W3C Member organizations, and who have contributed to W3C standards. FOAF isn’t enough to express that, but mixed in with other schemas, much of that information is available in the Semantic Web already.

Forget about validation, RDF checking, and constraints for a while. Think about querying… about how an RDF application would express a request for some chunks of data (whether using one schema, or several). W3C has a new RDF query language for this, called SPARQL. Here’s how we would use SPARQL to ask for W3C team birthdays. If using the 6-statement example, we won’t get many answers. The query would be:


PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?name, ?bday
WHERE {
?p rdf:type foaf:Person .
?p foaf:name ?name .
?p foaf:birthday ?bday .
?p foaf:workplaceHomepage <http://www.w3.org/> .
}
ORDER BY ?name

The query describes a pattern to be matched against some RDF data, and is answered with a table of variable-to-value bindings, pretty much like SQL. The query looks a lot like an RDF graph, in fact, with “?” marking variables, ie. nodes in the graph where we don’t know the specific content of the node. Our application query here says “find me values for ?name and ?bday for anything (we’ll call it “p”) that has a type foaf:Person and a workplaceHomepage matching W3C’s homepage.

I don’t intend to give a SPARQL tutorial here. The official spec is the best place to start learning (and please do send review comments!); there are also some (slightly out of date) tutorial materials online. You might also look at the RDFAuthor tutorial, since RDFAuthor serves as an authoring tool both for RDF itself, and for RDF queries (using Squish, one of the languages that fed into the SPARQL design). That tutorial makes one point quite clearly: RDF queries and RDF graphs are pretty similar structures. If you delve into the details of SPARQL, you’ll find some points at which it departs the simple-minded world of RDF triples. It lets you ask queries, for example, where you say things about the source of the graph, as well as queries in which certain properties are marked as optional (eg. we might ask for photographic depictions of the people in the query, yet not want the query to fail if those bits of data weren’t available in the graph). Both of these features are useful and were expected when the SPARQL work began, but they move us away from the simple narrative of “RDF queries are just like RDF graphs with bits of graph marked as missing“. As a quick intro to RDF querying with SPARQL, that concept is worth hanging onto. And it also helps us think about the relationship between RDF query and RDF “validation” or data checking.

I mentioned the Schematron system earlier. Schematron is useful when thinking about how new kinds of RDF checking and validation might work. It is built on top of the XPath spec. By testing to see whether specific XPath addresses match against some target document, we can probe the contents to see whether they meet our application needs or not. The tagline on the Schematron site says it all: Schematron is a language for making assertions about patterns found in XML documents. A few words from the overview page, which describes how
you can develop and mix two kinds of schemas:

  • Report elements allow you to diagnose which variant of a language you are dealing with.
  • Assert elements allow you to confirm that the document conforms to a particular schema.

The approach is pretty simple:

  • First, find a context nodes in the document (typically an element) based on XPath path criteria;
  • Then, check to see if some other XPath expressions are true, for each of those nodes.

That’s it, really. Schematron schemas allow you to check XML documents against rules that go beyond those that come with the elements and attributes used by the document. You might, for