Quick clarification on SPARQL extensions and “Lock-in”

It’s clear from discussion bouncing around IRC, Twitter, Skype and elsewhere that “Lock-in” isn’t a phrase to use lightly.

So I post this to make myself absolutely clear. A few days ago I mentioned in IRC a concern that newcomers to SPARQL and RDF databases might not appreciate which SPARQL extensions are widely implemented, and which are the specialist offerings of the system they happen to be using. I mentioned OpenLink’s Virtuoso in particular as a SPARQL implementation that had a rich and powerful set of extensions.

Since it seems there is some risk I might be mis-interpreted as suggesting OpenLink are actively trying to “do a Microsoft” and trap users in some proprietary pseudo-SPARQL, I’ll state what I took to be obvious background knowledge: OpenLink is a company who owe their success to the promotion of cross-vendor database portability, they have been tireless advocates of a standards-based Semantic Web, and they’re active in proposing extensions to W3C for standardisation. So – no criticism of OpenLink intended. None at all.

All I think we need here, are a few utilities that help developers understand the nature of the various SPARQL dialects and the potential costs/benefits of using them. Perhaps an online validator, alongside those for RDF/XML, RDFa, Turtle etc. Such a validator might usefully list the extensions used in some query, and give pointers (perhaps into a wiki) where the status of the various extensions constructs can be discussed and documented.

Since SPARQL is such a young language, it lacks a lot of things that are taken from granted in the SQL world, and so using rich custom extensions when available is for many developers a sensible choice. My only concern is that it must be a choice, and one entered into consciously.

OpenSocial schema extraction: via Javascript to RDF/OWL

OpenSocial’s API reference describes a number of classes (‘Person’, ‘Name’, ‘Email’, ‘Phone’, ‘Url’, ‘Organization’, ‘Address’, ‘Message’, ‘Activity’, ‘MediaItem’, ‘Activity’, …), each of which has various properties whose values are either strings, references to instances of other classes, or enumerations. I’d like to make them usable beyond the confines of OpenSocial, so I’m making an RDF/OWL version. OpenSocial’s schema is an attempt to provide an overarching model for much of present-day mainstream ‘social networking’ functionality, including dating, jobs etc. Such a broad effort is inevitably somewhat open-ended, and so may benefit from being linked to data from other complementary sources.

With a bit of help from the shindig-dev list, #opensocial IRC, and Kevin Brown and Kevin Marks, I’ve tracked down the source files used to represent OpenSocial’s data schemas: they’re in the opensocial-resources SVN repository on code.google.com. There is also a downstream copy in the Apache Shindig SVN repo (I’m not very clear on how versioning and evolution is managed between the two). They’re Javascript files, structured so that documentation can be generated via javadoc. The Shindig-PHP schema diagram I posted recently is a representation of this schema.

So – my RDF version. At the moment it is merely a list of classes and their properties (expressed using via rdfs:domain), written using RDFa/HTML. I don’t yet define rdfs:range for any of these, nor handle the enumerated values (opensocial.Enum.Smoker, opensocial.Enum.Drinker, opensocial.Enum.Gender, opensocial.Enum.LookingFor, opensocial.Enum.Presence) that are defined in enum.js.

The code is all in the FOAF SVN, and accessible via “svn co http://svn.foaf-project.org/foaftown/opensocial/vocab/”. I’ve also taken the liberty of including a copy of the OpenSocial *.js files, and Mozilla’s Rhino Javascript interpreter js.jar in there too, for self-containedness.

The code in schemarama.js will simply generate an RDFA/XHTML page describing the schema. This can be checked using the W3C validator, or converted to RDF/XML with the pyRDFa service at W3C.

I’ve tested the output using the OwlSight/pellet service from Clark & Parsia, and with Protege 4. It’s basic but seems OK and a foundation to build from. Here’s a screenshot of the output loaded into Protege (which btw finds 10 classes and 99 properties).

An example view from protege, showing the class browser in one panel, and a few properties of Person in another.

OK so why might this be interesting?

  • Using OpenSocial-derrived vocabulary, OpenSocial-exported data in other contexts
    • databases (queryable via SPARQL)
    • mixed with FOAF
    • mixed with Microformats
    • published directly in RDFa/HTML
  • Mapping OpenSocial terms with other contact and social network schemas

This suggests some goals for continued exploration:

It should be possible to use “OpenSocial markup” in an ordinary homepage or blog (HTML or XHTML), drawing on any of the descriptive concepts they define, through using RDFa’s markup notation. As Mark Birbeck pointed out recently, RDFa is an empty vessel – it does not define any descriptive vocabulary. Instead, the RDF toolset offers an environment in which vocabulary from multiple independent sources can be mixed and merged quite freely. The hard work of the OpenSocial team in analysing social network schemas and finding commonalities, or of the Microformats scene in defining simple building-block vocabularies … these can hopefully be combined within a single environment.

SQL, OpenOffice: would a JDBC driver for SPARQL protocol make sense?

I’d like to have access to my SPARQL data from desktop tools like OpenOffice’s spreadsheet. Apparently it can talk to remote databases via JDBC and OBDBC, both for the spreadsheet and database tools within OpenOffice.

Many years ago, pre-SPARQL, I hassled Libby into making her Squish RDF query system talk JDBC. There have been related efforts since, but I’ve lost track of what’s currently feasible with modern tools. There was a SPARQL4J collaboration initially between Asemantics, Profium and HP Labs, and some JBDC code from 2005 is on sourceforge. SPARQL inside JDBC is always going to be a bit of a hack, I suspect, since SPARQL results aren’t exactly like SQL result. But if it’s the easiest way to get things into a spreadsheet, is probably worth pursuing. I suspect this is the sort of problem OpenLink are good at, given their database driver background.

Would it be possible to wire up OpenOffice somehow so that it thought it was talking JDBC, but the connection parameters told the Java driver about a SPARQL endpoint URLs. Actually all this is a bit predicated on the assumption that OpenOffice allows the app user to pass in SQL, so we could pass SPARQL instead. If all the SQL is generated by OpenOffice we’d need to hack the OO src. Amy I making any sense? What still needs to be done? Where are we today?

“The World is now closed”

Facebook in many ways is pretty open for a ‘social networking’ site. It gives extension apps a good amount of access to both data and UI. But the closed world language employed in their UI betrays the immodest assumption “Facebook knows all”.

  • Eric Childress and Stuart Weibel are now friends with Charles McCathienevile.
  • John Doe is now in a relationship.
  • You have 210 friends.

To state the obvious: maybe Eric, Stu and Chaals were already friends. Maybe Facebook was the last to know about John’s relationship; maybe friendship isn’t countable. As the walls between social networking sites slowly melt (I put Jabber/XMPP first here, with OpenID, FOAF, SPARQL and XFN as helper apps), me and my 210 closest friends will share fragments of our lives with a wide variety of sites. If we choose to make those descriptions linkable, the linked sites will increasingly need to refine their UI text to be a little more modest: even the biggest site doesn’t get the full story.

Closed World Assumption (Abort/Retry/Fail)
Facebook are far from alone in this (see this Xbox screenshot too, “You do not have any friends!”); but even with 35M users, the mistake is jarring, and not just to Semantic Web geeks of the missing isn’t broken school. It’s simply a mistake to fail to distinguish the world from its description, or the territory from the map.

A description of me and my friends hosted by a big Web site isn’t “my social network”. Those sites are just a database containing claims made by different people, some verified, some not. And with, inevitably, lots missing. My “social network” is an abstractification of a set of interlinked real-world histories. You could make the case that there has only ever been one “social network” since the distant beginnings of human society; certainly those who try to do geneology with Web data formats run into this in a weaker form, including the need to balance competing and partial information. We can do better than categorised “buddylists” when describing people, their inter-connections and relationships. And in many ways Facebook is doing just great here. Aside from the Pirates-vs-Ninjas noise, many extension applications on Facebook allow arbitrary events from elsewhere in the Web to bubble up through their service and be seen (or filtered) by others who are linked to me in their database. For example:

Facebook is good at reporting events, generally. Especially those sourced outside the system. Where it isn’t so great is when reporting internal-events, eg. someone telling it about a relationship. Event descriptions are nice things to syndicate btw since they never go out of date. Syndicating descriptions of the changeable properties of the world, on the other hand, is more slippery since you need to have all other relevant facts to be able to say how the world is right now (or implicitly, how it used to be, before). “Dan has painted his car red” versus “Dan’s car is now red”. “Dan has bookmarked the Jabber user profile spec” versus “Dan now has 1621 bookmarks”. “Dan has added Charles to his Facebook profile” versus “Dan is now friends with Charles”.

We need better UI that reflects what’s really going on. There will be users who choose to live much of their lives in public view, spread across sites, sharing enough information for these accounts to be linked. Hopefully they’ll be as privacy-smart and selective as Pew suggests. Personas and ‘characters’ can be spread across sites without either site necessarily revealing a real-world identity; secrets are keepable, at least in theory. But we will see people’s behaviour and claims from one site leak into another, and with approval. I don’t think this will be just through some giant “social graph” of strictly enumerated relationships, but through a haze of vaguer data.

What we’re most missing is a style of end-user UI here that educates users about this world that spans websites, couching things in terms of claims hosted in sites, rather than in absolutist terms. I suppose I probably don’t have 210 “friends” (whatever that means) in real life, although I know a lot of great people and am happy to be linked to them online. But I have 210 entries in a Facebook-hosted database. My email whitelist file has 8785 email addresses in it currently; email accounts that I’m prepared to assume aren’t sending me spam. I’m sure I can’t have 8785 friends. My Google Mail (and hence GTalk Jabber) account claims 682 contacts, and has some mysterious relationship to my Orkut account where I have 200+ (more randomly selected) friends. And now the OpenID roster on my blog gives another list (as of today, 19 OpenIDs that made it past the WordPress spam filter). Modern social websites shouldn’t try to tell me how many friends I have; that’s just silly. And they shouldn’t assume their database knows it all. What they can do is try to tell me things that are interesting to me, with some emphasis on things that touch my immediate world and the extended world of those I’m variously connected to.

So what am I getting at here? I guess it’s just that we need these big social sites to move away from making teen-talk claims about how the world is – “Sally (now) loves John” – and instead become reflectors for the things people are saying, “Sally announces that she’s in love with John”; “John says that he used to work for Microsoft” versus “John worked for Microsoft 2004-2006″; “Stanford University says Sally was awarded a PhD in 2008″. Today’s young internet users are growing up fast, and the Web around them needs also to mature.

One of the most puzzling criticisms you’ll often hear about the Semantic Web initiative is that is requires a single universal truth, a monolithic ontology to model all of human knowledge. Those of us in the SW community know that this isn’t so; we’ve been saying for a long time that our (meta)data architecture is designed to allow people to publish claims “in which
statements can draw upon multiple vocabularies that are managed in a decentralised fashion by various communities of expertise.”
As the SemWeb technology stack now has a much better approach to representing data provenance (SPARQL named graphs replacing RDF’99 statement reification) I believe we should now be putting more emphasis on a related theme: Semantic Web data can represent disputes, competing claims, and contradictions. And we can query it in an SQL-like language (SPARQL) that allows us to ask questions not just of some all-knowing database, but about what different databases are telling us.

The closed world approach to data gives us a lot, don’t get me wrong. I’m not the only one with a love-hate relationship with SQL. There are many optimisations we can do in a traditional SQL or XML Schema environment which become hard in an RDF context. In particular, going “open world” makes for a harder job when hosting and managing data rather than merely aggregating and integrating it. Nevertheless, if you’re looking for a modern Web data environment for aggregating claims of the “Stanford University says Sally was awarded a PhD in 1995″ form, SPARQL has a lot to offer.

When we’re querying a single, all-knowing, all-trusted database, SQL will do the job (eg. see Facebook’s FQL for example). When we need to take a bit more care with “who said what” and “according to whom?” aspects, coupled with schema extensibility and frequently missing data, SQL starts to hurt. If we’re aggregating (and building UI for) ‘social web’ claims about the world rather than simple buddylists (which XMPP/Jabber gives us out of the box), I suspect aggregators will get burned unless they take care to keep careful track of who said what, whether using SPARQL or some home-grown database system in the same spirit. And I think they’ll find that doing so will be peculiarly rewarding, giving us a foundation for applications that do substantially more than merely listing your buddies…

OpenID plugin for WordPress

I’ve just installed Alan J Castonguay’s WordPress OpenID plugin on my blog, part of a cleanup that included nuking 11000+ comments in the moderation queue using the Spam Karma 2 plugin. Apologies if I zapped any real comments too. There are a few left, at least!

The OpenID thing appears to “just work”. By which I mean, I could log in via it and leave a comment. I’d be super-grateful if those of you with OpenIDs could take a minute to leave a comment on this post, to see if it works as well as it seems to. If it doesn’t, a bug report (to danbrickley@gmail.com) would be much appreciated. Those of you with LiveJournals or AOL/AIM accounts already have OpenID, even if you didn’t notice. See the HTML source for my homepage to see how I use “danbri.org” as an OpenID while delegating the hard work to LiveJournal. For more on OpenID, check out these tutorial slides (flash/pdf) from Simon Willison and David Recordon.

Thinking about OpenID-mediated blog comments, the tempting thing then would be to do something with the accumulated URIs. The plugin keeps its data in nice SQL tables and presumably accessible by other WordPress plugins. It’s been a while since I made a WordPress plugin, but they seem to have a pretty good framework accessible to them now.

mysql> select user_id, url from wp_openid_identities;
+---------+--------------------+
| user_id | url                |
+---------+--------------------+
|      46 | http://danbri.org/ |
+---------+--------------------+
1 row in set (0.28 sec)

At the moment, it’s just me. It’d be fun to try scooping up RDF (FOAF, SKOS, SIOC, feeds…) from any OpenID URIs that accumulate there. Hmm I even wrote up that project idea a while back – SparqlPress. At the time I tried prototyping it in Redland + PHP, but nowadays I’d probably use Benjamin Nowack’s ARC library, which provides SPARQL query of a MySQL-backed RDF store, and is written in PHP. This gives it the same dependencies as WordPress, making it ideal for pluginization. If anyone’s looking for a modest-sized practical SemWeb project to hack on, that one could be a lot of fun.

There’s a lot of interesting and creative fuss about “social networking” site interop around lately, largely thanks to the social graph paper from Brad Fitzpatrick and David Recordon. I lean towards the “show me, don’t tell me” approach regarding buddylists and suchlike (as does Julian Bond with Ecademy), which is why FOAF has only ever had the mild-mannered “knows” relationship in the core vocabulary, rather than trying to over-formalise “bestest friend EVER” and other teenisms. So what I like about this WordPress plugin is that it gives some evidence-based raw material for decentralised social networking apps. Blog comments don’t tell the whole story; nothing tells the whole story. But rather than maintain a FOAF “knows” list (or blogroll, or blog-reader config) by hand, I’d prefer to be able to partially automate it by querying information about whose blogs I’ve commented on, and vice-versa. There’s a lot that could be built, intimidatingly much, that it’s hard to know where to start. I suggest that everyone in the SemWeb scene having an OpenID with a FOAF file linked from it would be an interesting platform from which to start exploring…

Meanwhile, I’ll try generating an RDF blogroll from any URIs that show up in my OpenID WordPress table, so I can generate a planetplanet or chumpologica configuration automatically…