JQbus: social graph query with XMPP/SPARQL

Righto, it’s about time I wrote this one up. One of my last deeds at W3C before leaving at the end of 2005, was to begin the specification of an XMPP binding of the SPARQL querying protocol. For the acronym averse, a quick recap. XMPP is the name the IETF give to the Jabber messaging technology. And SPARQL is W3C’s RDF-based approach to querying mixed-up Web data. SPARQL defines a textual query language, an XML result-set format, and a JSON version for good measure. There is also a protocol for interacting with SPARQL databases; this defines an abstract interface, and a binding to HTTP. There is as-yet no official binding to XMPP/Jabber, and existing explorations are flawed. But I’ll argue here, the work is well worth completing.

jqbus diagram

So what do we have so far? Back in 2005, I was working in Java, Chris Schmidt in Python, and Steve Harris in Perl. Chris has a nice writeup of one of the original variants I’d proposed, which came out of my discussions with Peter St Andre. Chris also beat me in the race to have a working implementation, though I’ll attribute this to Python’s advantages over Java ;)

I won’t get bogged down in the protocol details here, except to note that Peter advised us to use IQ stanzas. That existing work has a few slight variants on the idea of sending a SPARQL query in one IQ packet, and returning all the results within another, and that this isn’t quite deployable as-is. When the result set is too big, we can run into practical (rather than spec-mandated) limits at the server-to-server layers. For example, Peter mentioned that jabber.org had a 65k packet limit. At SGFoo last week, someone suggested sending the results as an attachment instead; apparently this one of the uncountably many extension specs produced by the energetic Jabber community. The 2005 work was also somewhat partial, and didn’t work out the detail of having a full binding (eg. dealing with default graphs, named graphs etc).

That said, I think we’re onto something good. I’ll talk through the Java stuff I worked on, since I know it best. The code uses Ignite Online’s Smack API. I have published rough Java code that can communicate with instances of itself across Jabber. This was last updated July 2007, when I fixed it up to use more recent versions of Smack and Jena. I forget if the code to parse out query results from the responses was completed, but it does at least send SPARQL XML results back through the XMPP network.

sparql jabber interaction

sparql jabber interaction

So why is this interesting?

  • SPARQLing over XMPP can cut through firewalls/NAT and talk to data on the desktop
  • SPARQLing over XMPP happens in a social environment; queries are sent from and to Jabber addresses, while roster information is available which could be used for access control at various levels of granularity
  • XMPP is well suited for async interaction; stored queries could return results days or weeks later (eg. job search)
  • The DISO project is integrating some PHP XMPP code with WordPress; SparqlPress is doing same with SPARQL

Both SPARQL and XMPP have mechanisms for batched results, it isn’t clear which, if either, to use here.

XMPP also has some service discovery mechanisms; I hope we’ll wire up a way to inspect each party on a buddylist roster, to see who has SPARQL data available. I made a diagram of this last summer, but no code to go with it yet. There is also much work yet to do on access control systems for SPARQL data, eg. using oauth. It is far from clear how to integrate SPARQL-wide ideas on that with the specific possibilities offered within an XMPP binding. One idea is for SPARQL-defined FOAF groups to be used to manage buddylist rosters, “friend groups”.

Where are we with code? I have a longer page in FOAF SVN for the Jqbus Java stuff, and a variant on this writeup (includes better alt text for the images and more detail). The Java code is available for download. Chris’s Python code is still up on his site. I doubt any of these can currently talk to each other properly, but they show how to deal with XMPP in different languages, which is useful in itself. For Perl people, I’ve uploaded a copy of Steve’s code.

The Java stuff has a nice GUI for debugging, thanks to Smack. I just tried a copy. Basically I can run a client and a server instance from the same filetree, passing it my LiveJournal and Google Talk jabber account details. The screenshot here shows the client on the left having the XML-encoded SPARQL results, and the server on the right displaying the query that arrived. That’s about it really. Nothing else ought to be that different from normal SPARQL querying, except that it is being done in an infrastructure that is more socially-grounded and decentralised than the classic HTTP Web service model.

JQbus debug

Small databases, loosly joined

Over the last month or so, I’ve had a SPARQL store (just an ARC instance plus a loader script) running on the Amazon EC2 sandbox I set up for FOAF experiments. Yesterday I installed a fresh SparqlPress bundle into my own blog, which runs on another server. So how to get the data across? Since SparqlPress includes a scutter, which despite Urban Dictionary is just the FOAF term for RDF crawler, we can set it to crawl a Web of linked RDF. Here’s a SPARQL query on the former installation which sets up some pointers to each graph (scutters often follow rdfs:seeAlso, amongst other techniques). It’s interesting too because it summarises the data found in each graph, simply by listing properties.

CONSTRUCT {
<http://danbri.org/foaf.rdf#danbri> <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?g .
?g <http://purl.org/net/scutter/vocabItem> ?p .
}
WHERE {
GRAPH ?g { ?s ?p ?o }
}

The sc:vocabItem property is a work of fiction; I’ve not searched around to see if similar things already exist. The idea of linking together medium-sized repositories of metadata is an old one. The Harvest crawling/indexing system did this with RDM/SOIF. In a similar vein was WHOIS++, a now-obsolete directory system that some of us tried using for resource discovery in the late 90s – collections of Web site descriptions federated through having each server syndicate a summary of its content (record types, field names, and tokenized field values). I hadn’t really thought of it this way until I wrote this query, but SparqlPress could in some ways support a reimplementation of the Harvest and WHOIS++ design, if we have gatherer and broker roles attached to a network of blogs. All these ideas are of course close to the P2P scene too: little datasets exchanging summaries of their content to aid query routing. The main point here is that we can run a pretty simple SPARQL query, and get a summary of the contents of various named data graphs; the results of the query in turn give a summary of a database that has copies of these graphs.

Graph URIs in SPARQL: Using UUIDs as named views

I’ve been using the SPARQL query language to access a very ad-hoc collection of personal and social graph data, and thanks to Bengee’s ARC system this can sit inside my otherwise ordinary WordPress installation. At the moment, everything in there is public, but lately I’ve been discussing oauth with a few folk as a way of mediating access to selected subsets of that data. Which means the data store will need some way of categorising the dozens of misc data source URIs. There are a few ways to do this; here I try a slightly non-obvious approach.

Every SPARQL store can have many graphs inside, named by URI, plus optionally a default graph. The way I manage my store is a kind of structured chaos, with files crawled from links in my own data and my friends. One idea for indicating the structure of this chaos is to keep “table of contents” metadata in the default graph. For example, I might load up <http://danbri.org/foaf.rdf> into a SPARQL graph named with that URI. And I might load up <http://danbri.org/evilfoaf.rdf> into another graph, also using the retrieval URI to identify the data within my SPARQL store. Now, two points to make here: firstly, that the SPARQL spec does not mandate that we do things this way. An application that for example wanted to keep historical versions of a FOAF or RSS of schema document, could keep the triples from each version in a different named graph. Perhaps these might be named with UUIDs, for example. The second point, is that there are many different “meta” claims we want to store about our datasets. And that mixing them all into the store-wide “default graph” could be rather limiting, especially if we mightn’t want to unconditionally believe even those claims.

In the example above for example, I have data from running a PGP check against foaf.rdf (which passed) and evilfoaf.rdf (which doesn’t pass a check against my pgp identity). Now where do I store the results of this PGP checking? Perhaps the default graph, but maybe that’s messy. The idea I’m playing with here is that UUIDs are reasonable identifiers, and that perhaps we’ll find ourselves sharing common UUIDs across stores.

Go back to my sent-mail FOAF crawl example from yesterday. How far did I get? Well the end result was a list of URLs which I looped through, and loaded into my big chaotic SPARQL store. If I run the following query, I get a list of all the data graphs loaded:

SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s ?p ?o . } }

This reveals 54 URLs, basically everything I’ve loaded into ARC in the last month or so. Only 30 of these came from yesterday’s hack, which used Google’s new Social Graph API to allow me to map from hashed mailbox IDs to crawlable data URIs. So today’s game is to help me disentangle the 30 from the 54, and superimpose them on each other, but not always mixed with every other bit of information in the store. In other words, I’m looking for a flexible, query-based way of defining views into my personal data chaos.

So, what I tried. I took the result of yesterday’s hack, a file of data URIs called urls.txt. Then I modified my commandline dataloader script (yeah yeah this should be part of wordpress). My default data loader simply takes each URI, gets the data, and shoves it into the store under a graph name which was the URI used for retrieval. What I did today is, additionally, make a “table of contents” overview graph. To avoid worrying about names, I generated a UUID and used that. So there is a graph called <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> which contains simple asserts of the form:

<http://www.advogato.org/person/benadida/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.w3c.es/Personal/Martin/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.iandickinson.me.uk/rdf/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> . # etc

…for each of the 30 files my crawler loaded into the store.

This lets us use <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> as an indirection point for information related to this little mailbox crawler hack. I don’t have to “pollute” the single default graph with this data. And because the uuid: was previously meaningless, it is something we might decided makes sense to use across data visibility boundaries, ie. you might use the same UUID in your own SPARQL store, so we can share queries and app logic.

Here’s a simple query. It says, “ask the mailbox crawler table of contents graph (which we call uuid:320d9etc…) for all things it knows about that are a Document”. Then it says “ask each of those documents, for everything in it”. And then the SELECT clause returns all the property URIs. This gives a first level overview of what’s in the Web of data files found by the crawl. Query was:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?p WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { ?s ?p ?o . }
}

ORDER BY ?p

I’ll just show the first page full of properties it found; for the rest see link to the complete set. Since W3C’s official SPARQL doesn’t have aggregates, we’d need to write application code (or use something like the SPARQL+ extensions) to get property usage counts. Here are some of the properties that were found in the data:

  http://kota.s12.xrea.com/vocab/uranaibloodtype

http://purl.org/dc/elements/1.1/creator

http://purl.org/dc/elements/1.1/description

http://purl.org/dc/elements/1.1/format

http://purl.org/dc/elements/1.1/title

http://purl.org/dc/terms/created

http://purl.org/dc/terms/modifed

http://purl.org/dc/terms/modified

http://purl.org/net/inkel/rdf/schemas/lang/1.1#masters

http://purl.org/net/inkel/rdf/schemas/lang/1.1#reads

http://purl.org/net/inkel/rdf/schemas/lang/1.1/masters

http://purl.org/net/inkel/rdf/schemas/lang/1.1/reads

http://purl.org/net/inkel/rdf/schemas/lang/1.1/speaks

http://purl.org/net/schemas/quaffing/drankBeerWith

http://purl.org/net/schemas/quaffing/drankLagerWith

http://purl.org/net/vocab/2004/07/visit#caregion

http://purl.org/net/vocab/2004/07/visit#country

http://purl.org/net/vocab/2004/07/visit#usstate

http://purl.org/ontology/mo/hasTrack

http://purl.org/ontology/mo/myspace

http://purl.org/ontology/mo/performed

So my little corner of the Web includes properties that extend FOAF documents to include blood types, countries that have been visited, language skills that people have, music information, and even drinking habits. But remember that this comes from my corner of the Web – people I’ve corresponded with – and probably isn’t indicative of the wider network. But that’s what grassroots decentralised data is all about. The folk who published this data didn’t need to ask permission of any committe to do so, they just mixed in what they wanted to say, alongside terms more widely used like foaf:Person, foaf:name. This is the way it should be: ask forgiveness, not permission, from the language lawyers and standardistas.

Ok, so let’s dig deeper into the messy data I crawled up from my sentmail contacts?

Here’s one that finds some photos, either using FOAF’s :img or :depiction properties:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled {
{ ?x :depiction ?y1 } UNION { ?x :img ?y2 } .
}
}

Here’s another that asks the crawl results for names and homepages it found:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE { GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { { [ :name ?n; :homepage ?h ] } }
}

To recap, the key point here is that social data in a SPARQL store will be rather chaotic. Information will often be missing, and often be extended. It will come from a variety of parties, some of whom you trust, some of whom you don’t know, and a few of whom will be actively malicious. Later down the line, subsets of the data will need different permissioning: if I export a family tree from ancestry.co.uk, I don’t want everyone to be able to do a SELECT for mother’s maiden name and my date of birth.

So what I suggest here, is that we can use UUID-named graphs as an organizing structure within an otherwise chaotic SPARQL environment. The demo here shows how one such graph can be used as a “table of contents” for other graphs associated with a particular app — in this case, the Google-mediated sentmail crawling app I made yesterday. Other named views might be: those data files from colleagues, those files that are plausibly PGP-signed, those that contain data structured according to some particular application need (eg. calendar, addressbook, photos, …).

RDF in Ruby revisited

If you’re interested in collaborating on Ruby tools for RDF, please join the public-rdf-ruby@w3.org mailing list at W3C. Just send a note to public-rdf-ruby-request@w3.org with a subject line of “subscribe”.

Last weekend I had the fortune to run into Rich Kilmer at O’Reilly’s ‘Social graph Foo Camp‘ gathering. In addition to helping decorate my tent, Rich told me a bit more about the very impressive Semitar RDF and OWL work he’d done in Ruby, initially as part of the DAML programme. Matt Biddulph was also there, and we discussed again what it would take to include FOAF import into Dopplr. I’d be really happy to see that, both because of Matt’s long history of contributions to the Semantic Web scene, but also because Dopplr and FOAF share a common purpose. I’ve long said that a purpose of FOAF is to engineer more coincidences in the world, and Dopplr comes from the same perspective: to increase serendipity.

Now, the thing about engineering serendipity, is that it doesn’t work without good information flow. And the thing about good information flow, is that it benefits from data models that don’t assume the world around us comes nicely parceled into cleanly distinct domains. Borrowing from American Splendor – “ordinary life is pretty complex stuff“. No single Web site, service, document format or startup is enough; the trick comes when you hook things together in unexpected combinations. And that’s just what we did in the RDF world: created a model for mixed up, cross-domain data sharing.

Dopplr, Tripit, Fire Eagle and other travel and location services may know where you and others are. Social network sites (and there are more every day) knows something of who you are, and something of who you care about. And the big G in the sky knows something of the parts of this story that are on the public record.

Data will always be spread around. RDF is a handy model for ad-hoc data merging from multiple sources. But you can’t do much without an RDF parser and a few other tools. Minimally, an RDF/XML parser and a basic API for navigating the graph. There are many more things you could add. In my old RubyRdf work, I had in-memory and SQL-backed storage, with a Squish query interface to each. I had a donated RDF/XML parser (from Ruby4R) and a much-improved query engine (with support for optionals) from Damian Steer. But the system is code-rotted. I wrote it when I was learning Ruby beginning 7 years ago, and I think it is “one to throw away”. I’m really glad I took the time to declare that project “closed” so as to avoid discouraging others, but it is time to revisit Ruby and RDF again now.

Other tools have other offerings: Dave Beckett’s Redland system (written in C) ships with a Ruby wrapper. Dave’s tools probably have the best RDF parsing facilities around, are fast, but require native code. Rena is a pure Ruby library, which looked like a great start but doesn’t appear to have been developed further in recent years.

I could continue going through the list of libraries, but Paul Stadig has already done a great job of this recently (see also his conclusions, which make perfect sense). There has been a lot of creative work around RDF/RDFS and OWL in Ruby, and collectively we clearly have a lot of talent and code here. But collectively we also lack a finished product. It is a real shame when even an RDF-enthusiast like Matt Biddulph is not in a position to simply “gem install” enough RDF technology to get a simple job done. Let’s get this fixed. As I said above,

If you’re interested in collaborating on Ruby tools for RDF, please join the public-rdf-ruby@w3.org mailing list at W3C. Just send a note to public-rdf-ruby-request@w3.org with a subject line of “subscribe”.

In six months time, I’d like to see at least one solid, well rounded and modern RDF toolkit packaged as a Gem for the Ruby community. It should be able to parse RDF/XML flawlessy, and in addition to the usual unit tests, it should be wired up to the RDF Test Cases (see download) so we can all be assured it is robust. It should allow for a fast C parser such as Raptor to be used if available, falling back on pure Ruby otherwise. There should be a basic API that allows me to navigate an RDF graph of properties and values using clear, idiomatic Ruby. Where available, it should hook up to external stores of data, and include at least a SPARQL protocol client, eventually a full SPARQL implementation. It should allow multiple graphs to be super-imposed and disentangled. Some support for RDFS, OWL and rule languages would be a big plus. Support for other notations such as Turtle, RDFa, or XSLT-based GRDDL transforms would be useful, as would a plugin for microformat import. Transliterating Python code (such as the tiny Euler rule engine) should be considered. Divergence from existing APIs in Python (and Perl, Javascript, PHP etc) should be minimised, and carefully balanced against the pull of the Ruby way. And (thought I lack strong views here) it should be made available under a liberal opensource license that permits redistribution under GPL. It should also be as I18N and Unicode-friendly as is possible in Ruby these days.

I’m not saying that all RDF toolkits should be merged, or that collaboration is compulsory. But we are perilously fragmented right now, and collaboration can be fun. In six months time, people who simply want to use RDF from Ruby ought to be pleasantly suprised rather than frustrated when they take to the ‘net to see what’s out there. If it takes a year instead of six months, sure whatever. But not seven years! It would be great to see some movement again towards a common library…

How hard can it be?

Waving not Drowning? groups as buddylist filters

I’ve lately started writing up and prototyping around a use-case for the “Group” construct in FOAF and for medium-sized, partially private data aggregators like SparqlPress. I think we can do something interesting to deal with the social pressure and information load people are experiencing on sites like Flickr and Twitter.

Often people have rather large lists of friends, contacts or buddys – publically visible lists – which play a variety of roles in their online life. Sometimes these roles are in tension. Flickr, for example, allow their users to mark some of their contacts as “friend” or “family” (or both). Real life isn’t so simple, however. And the fact that this classification is shared (in Flickr’s case with everyone) makes for a potentially awkward dynamic. Do you really want to tell everyone you know whether they are a full “friend” or a mere “contact”? Let alone keep this information up to date on every social site that hosts it. I’ve heard a good few folk complain about the stress of adding yet another entry to their list of “twitter follows” people.  Their lists are often already huge through some sense of social obligation to reciprocate. I think it’s worth exploring some alternative filtering and grouping mechanisms.

On the one hand, we have people “bookmarking” people they find interesting, or want to stay in touch with, or “get to know better”. On the other, we have the bookmarked party sometimes reciprocating those actions because they feel it polite; a situation complicated by crude categories like “friend” versus “contact”. What makes this a particularly troublesome combination is when user-facing features, such as “updates from your buddies” or “photos from your friends/contacts” are built on top of these buddylists.

Take my own case on Flickr, I’m probably not typical, but I’m a use case I care about. I have made no real consistent use of the “friend” versus “contact” distinction; to even attempt do so would be massively time consuming, and also a bit silly. There are all kinds of great people on my Flickr contacts list. Some I know really well, some I barely know but admire their photography, etc. It seems currently my Flickr account has 7 “family”, 79 “friends” and 604 “contacts”.

Now to be clear, I’m not grumbling about Flickr. Flickr is a work of genius, no nitpicking. What I ask for is something that goes beyond what Flickr alone can do. I’d like a better way of seeing updates from friends and contacts. This is just a specific case of a general thing (eg. RSS/Atom feed management), but let’s stay specific for now.

Currently, my Flickr email notification settings are:

  • When people comment on your photos: Yes
  • When your friends and family upload new photos: Yes (daily digest)
  • When your other contacts upload new photos: No

What this means is that I selected to see notifications photos from those 86 people who I haveflagged as “friend” or “family”. And I chose not to be notified of new photos from the other 604 contacts. Even though that list contains many people I know, and would like to know better. The usability question here is: how can we offer more subtlety in this mechanism, without overwhelming users with detail? My guess is that we approach this by collecting in a more personal environment some information that users might not want to state publically in an online profile. So a desktop (or weblog, e.g. SparqlPress) aggregation of information from addressbooks, email send/receive patterns, weblog commenting behaviours, machine readable resumes, … real evidence of real connections between real people.

And so I’ve been mining around for sources of “foaf:Group” data. As a work in progress testbed I have a list of OpenIDs of people whose comments I’ve approved in my blog. And another, of people who have worked on the FOAF wiki. I’ve been looking at the machine readable data from W3C’s Tech Reports digital archive too, as that provides information about a network of collaboration going back to the early days of the Web (and it’s available in RDF). I also want to integrate my sent-mail logs, to generate another group, and extract more still from my addressbook. On top of this of course I have access to the usual pile of FOAF and XFN, scraped or API-extracted information from social network sites and IM accounts. Rather than publish it all for the world to see, the idea here is to use it to generate simple personal-use user interfaces couched in terms of these groups. So, hopefully I shall be able to use those groups as a filter against the 600+ members of my flickr buddylist, and make some kind of basic RSS-filtered view of their photos.

If this succeeds, it may help people to have huge buddylists without feeling they’ve betrayed their “real friends” by doing so, or forcing them to put their friends into crudely labelled public buckets marked “real friend” and “mere contact”. The core FOAF design principle here is our longstanding bias towards evidence-based rather than proclaimed information.  Don’t just ask people to tell you who their friends are, and how close the relationship is (especially in public, and most especially in a format that can be imported into spreadsheets for analysis). Instead… take the approach of  “don’t say it, show it“. Create tools based on the evidence that friendship and collaboration leaves in the Web. The public Web, but also the bits that don’t show up in Google, and hopefully never will. If we can find a way to combine all that behind a simple UI, it might go a little way towards “waving not drowning” in information.

Much of this is stating the obvious, but I thought I’d leave the nerdly details of SPARQL and FOAF/RDF for another post.

Instead here’s a pointer to a Scoble article on Mr Facebook, where they’re grappling with similar issues but from a massive aggregation rather than decentralised perspective:

Facebook has a limitation on the number of friends a person can have, which is 4,999 friends. He says that was due to scaling/technical issues and they are working on getting rid of that limitation. Partly to make it possible to have celebrities on Facebook but also to make it possible to have more grouping kinds of features. He told me many members were getting thousands of friends and that those users are also asking for many more features to put their friends into different groups. He expects to see improvements in that area this year.

ps. anyone know if Facebook group membership data is accessible through their APIs? I know I can get group data from LiveJournal and from My Opera in FOAF, but would be cool to mix in Facebook too.

Embedding queries in RDF – FOAF Group example

Is this crazy or useful? Am not sure yet.

This example uses FOAF vocabulary for groups and openid. So the basic structure here is that Agents (including persons) can have an :openid and can be a :member of a :Group.

From an openid-augmented WordPress, we get a list of all the openids my blog knows about. From an openid-augmented MediaWiki, we get a list of all the openids that contribute to the FOAF project wiki. I dumped each into a basic RDF file (not currently an automated process). But the point here is to explore enumerated groups using queries.

<rdf:RDF xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns=”http://xmlns.com/foaf/0.1/”>
<Group rdf:about=’#both’>
<!– enumerated membership –>
<member><Agent><openid rdf:resource=’http://danbri.org/’/></Agent></member>
<member><Agent><openid rdf:resource=’http://tommorris.org/’/></Agent></member>
<member><Agent><openid rdf:resource=’http://kidehen.idehen.net/dataspace/person/kidehen’/></Agent></member>
<member><Agent><openid rdf:resource=’http://www.wasab.dk/morten/’/></Agent></member>
<member><Agent><openid rdf:resource=’http://kronkltd.net/’/></Agent></member>
<member><Agent><openid rdf:resource=’http://www.kanzaki.com/’/></Agent></member>

<!– rule-based membership –>

<constructor><![CDATA[
PREFIX : <http://xmlns.com/foaf/0.1/>
CONSTRUCT {
<http://danbri.org/yasns/danbri/both.rdf#thegroup> a :Group; :member [ a :Agent; :openid ?id ]
}
WHERE {
GRAPH <http://wiki.foaf-project.org/_users.rdf> { [ a :Group; :member [ a :Agent; :openid ?id ]. ] }
GRAPH <http://danbri.org/yasns/danbri/_group.rdf> { [ a :Group; :member [ a :Agent; :openid ?id ]. ] }
}
]]></constructor>
</Group>
</rdf:RDF>

This RDF description does it both ways. It enumerates (for simple clients) a list of members of a group whose members are those individuals that are both commentators on my blog, and contributors to the FOAF wiki. At least, to the extent they’re detectable via common use of OpenID URIs. But the RDF group description also embeds a SPARQL query, the kind which generates RDF rather than an SQL-like resultset. The RDF essentially regenerates the enumerated list, assuming the query is run against an RDF dataset with the data graphs appropriately populated.

Now I sorta like this, and I sorta don’t. It may be incredibly powerful, or it may be a bit to clever for its own good.

Certainly there’s scope overlap with the W3C RIF rules work, and with the capabilities of OWL. FOAF has long contained an experimental method for using OWL to do something similar, but it hasn’t found traction. The motivation I have here for trying SPARQL here is that it has built-in machinery for talking about the provenance of data; so I could write a group description this way that says “members are anyone listed as a colleague in http://myworkplace.example.com/stafflist.rdf”. Or I could mix in arbitrary descriptive vocabularies; family tree stuff, XFN, language abilities (speaks-reads-writes) etc.

Where I think this could fall down is in the complexity of the workflow. The queries need executing against some SPARQL installation with a configured dataset, and the query lists URIs of data graphs. But I doubt database admins will want to randomly load any/every RDF file mentioned in these shared queries. Perhaps something like SparqlPress, attached to one’s weblog, and social filters to load only files in queries eg. from friends? Also, authoring these kinds of query isn’t something non-geek users are going to do often, and the sorts of queries that will work will depend of course on the data actually available. Sure I could write a query based on matching the openids of former colleagues, but the group will be empty unless the data listing people as former colleagues is actually out there and in the Web, and written in the terms anticipated by the query.

On the other hand, this mechanism does appeal, and could go way beyond FOAF group definitions. We could see a model where people post data in the Web but also post queries, eg. revisiting the old work Libby and I explored around RSS query. On the other other hand, who wants to make their Web queries public? All that said, the same goes for the data being queried. And since this technique embeds queries inside ordinary RDF data, however we deal with the data visibility issue for RDF/FOAF should also work for the query stuff. Perhaps. Can’t blame me for trying…
I realise this isn’t the clearest of explanations. Let’s try again:

RDF is normally for publishing collections of simple claims about the world. This is an experiment in embedding data-generating-queries amongst these claims, where the query is configured to output more RDF claims (aka statements, triples etc), but only when executed against some appropriate body of RDF data. Since the query is written in SPARQL, it allows the data-generation rules to mention interesting things, such as properties of the source of the data being queried.

This particular experiment is couched in terms of FOAF’s “Group” construct, but the technique is entirely general. The example above defines a group of agents called the “both” group, by saying that an Agent is in that group if it its OpenID URI is listed in each of two RDF documents specified, ie. both a commentator on my blog, and a contributor to the FOAF Wiki. Other examples could be “(fe)male employees” or “family members sharing a blood type” or in fact, any descriptive pattern that can match against the data to hand and be expressed in SPARQL.

Ruby client for querying SPARQL REST services

I’ve started a Ruby conversion of Ivan Herman’s Python SPARQL client, itself inspired by Lee Feigenbaum’s Javascript library. These are tools which simply transmit a SPARQL query across the ‘net to a SPARQL-protocol database endpoint, and handle the unpacking of the results. These queries can result in yes/no responses, variable-to-value bindings (rather like in SQL), or in chunks of RDF data. The default resultset notation is a simple XML format; JSON results are also widely available.

All I’ve done so far, is to sit down with the core Python file from Ivan’s package, and slog through converting it brainlessly into rather similar Ruby. Take out “self” and “:” from method definitions, write “nil” instead of “None”, write “end” at the end of each method and class, use Ruby’s iteration idioms, and you’re most of the way there. Of course the fiddly detail is where we use external libraries: for URIs, network access, JSON and XML parsing. I’ve cobbled something together which could be the basis for reconstructing all the original functionality in Ivan’s code. Currently, it’s more proof of concept, but enough to be worth posting.

Files are in SVN and browsable online; there’s also a bundled up download for the curious, brave or helpful. The test script shows all we have at the moment: ability to choose XML or JSON results, and access result set binding rows. Other result formats (notably RDF; there’s no RDF/XML parser currently) aren’t handled, and various bits of the code conversion are incomplete. I’d be super happy if someone came along and helped finish this!

Update:  Ruby’s REXML parser is now clumsily wired in; you get get a REXML Document object (see xml.com writeup) as a way of navigating the resultset.

SPARQL results in spreadsheets

I made a little progress with SPARQL and spreadsheets. On Kingsley Idehen’s advice, revisited OpenOffice (NeoOffice on MacOSX) and used the HTML table import utility.

It turns out that if I have an HTTP URL for a GET query to ARC‘s SPARQL endpoint, and I pass in the non-standard format=htmltab parameter, I get an HTML table which can directly be understood by NeoOffice. I’ll write this up properly another time, but in brief, I hacked the HTML form generator for the endpoint to use GET instead of POST, making it easier to get URLs associated with SPARQL queries. Then in NeoOffice, under “Insert” / “Link to External Data…”, paste in that URL, hit return, select the appropriate HTML table … and the data should import.

Here’s some fragments of FOAF and related info in NeoOffice (from a query that asked for the data source URI, a name and a homepage URI):

Open your data?

I tried this on a version of Excel for MacOSX (an old one, 2001), but failed to get HTML import working. Instead I just copied the info over from NeoOffice, since I wanted to try learning Excel’s Pivot and charting tools. Here is a quick example derrived from an Excel analysis of the RDF properties that appear in the different graphs in my SPARQL store. The imported data was just two columns a GRAPH URI, ?g, and a property URI ?property:


PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?g ?property
WHERE {
GRAPH ?g { ?thing ?property ?value . }
}
ORDER BY ?property

Excel (and NeoOffice) offer Pivot and charting tools (similar to simplified version of what you’ll find under the Business Intelligence banner elsewhere) that take simple flat tables like this and (as they say) slice, dice, count and summarise. Here’s an unreadable chart I managed to make it generate:

SPARQL stats

Am still learning the ropes here… but I like the idea of combining desktop data analysis and reports with an ARC SPARQL store attached to my blog, and populated with data crawled from my web2ish ‘network neighbourhood’ (ie. FOAF, RSS, XFN, SIOC etc). Work in spare-time progress.

SQL, OpenOffice: would a JDBC driver for SPARQL protocol make sense?

I’d like to have access to my SPARQL data from desktop tools like OpenOffice’s spreadsheet. Apparently it can talk to remote databases via JDBC and OBDBC, both for the spreadsheet and database tools within OpenOffice.

Many years ago, pre-SPARQL, I hassled Libby into making her Squish RDF query system talk JDBC. There have been related efforts since, but I’ve lost track of what’s currently feasible with modern tools. There was a SPARQL4J collaboration initially between Asemantics, Profium and HP Labs, and some JBDC code from 2005 is on sourceforge. SPARQL inside JDBC is always going to be a bit of a hack, I suspect, since SPARQL results aren’t exactly like SQL result. But if it’s the easiest way to get things into a spreadsheet, is probably worth pursuing. I suspect this is the sort of problem OpenLink are good at, given their database driver background.

Would it be possible to wire up OpenOffice somehow so that it thought it was talking JDBC, but the connection parameters told the Java driver about a SPARQL endpoint URLs. Actually all this is a bit predicated on the assumption that OpenOffice allows the app user to pass in SQL, so we could pass SPARQL instead. If all the SQL is generated by OpenOffice we’d need to hack the OO src. Amy I making any sense? What still needs to be done? Where are we today?