Graph URIs in SPARQL: Using UUIDs as named views

I’ve been using the SPARQL query language to access a very ad-hoc collection of personal and social graph data, and thanks to Bengee’s ARC system this can sit inside my otherwise ordinary WordPress installation. At the moment, everything in there is public, but lately I’ve been discussing oauth with a few folk as a way of mediating access to selected subsets of that data. Which means the data store will need some way of categorising the dozens of misc data source URIs. There are a few ways to do this; here I try a slightly non-obvious approach.

Every SPARQL store can have many graphs inside, named by URI, plus optionally a default graph. The way I manage my store is a kind of structured chaos, with files crawled from links in my own data and my friends. One idea for indicating the structure of this chaos is to keep “table of contents” metadata in the default graph. For example, I might load up <http://danbri.org/foaf.rdf> into a SPARQL graph named with that URI. And I might load up <http://danbri.org/evilfoaf.rdf> into another graph, also using the retrieval URI to identify the data within my SPARQL store. Now, two points to make here: firstly, that the SPARQL spec does not mandate that we do things this way. An application that for example wanted to keep historical versions of a FOAF or RSS of schema document, could keep the triples from each version in a different named graph. Perhaps these might be named with UUIDs, for example. The second point, is that there are many different “meta” claims we want to store about our datasets. And that mixing them all into the store-wide “default graph” could be rather limiting, especially if we mightn’t want to unconditionally believe even those claims.

In the example above for example, I have data from running a PGP check against foaf.rdf (which passed) and evilfoaf.rdf (which doesn’t pass a check against my pgp identity). Now where do I store the results of this PGP checking? Perhaps the default graph, but maybe that’s messy. The idea I’m playing with here is that UUIDs are reasonable identifiers, and that perhaps we’ll find ourselves sharing common UUIDs across stores.

Go back to my sent-mail FOAF crawl example from yesterday. How far did I get? Well the end result was a list of URLs which I looped through, and loaded into my big chaotic SPARQL store. If I run the following query, I get a list of all the data graphs loaded:

SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s ?p ?o . } }

This reveals 54 URLs, basically everything I’ve loaded into ARC in the last month or so. Only 30 of these came from yesterday’s hack, which used Google’s new Social Graph API to allow me to map from hashed mailbox IDs to crawlable data URIs. So today’s game is to help me disentangle the 30 from the 54, and superimpose them on each other, but not always mixed with every other bit of information in the store. In other words, I’m looking for a flexible, query-based way of defining views into my personal data chaos.

So, what I tried. I took the result of yesterday’s hack, a file of data URIs called urls.txt. Then I modified my commandline dataloader script (yeah yeah this should be part of wordpress). My default data loader simply takes each URI, gets the data, and shoves it into the store under a graph name which was the URI used for retrieval. What I did today is, additionally, make a “table of contents” overview graph. To avoid worrying about names, I generated a UUID and used that. So there is a graph called <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> which contains simple asserts of the form:

<http://www.advogato.org/person/benadida/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.w3c.es/Personal/Martin/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.iandickinson.me.uk/rdf/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> . # etc

…for each of the 30 files my crawler loaded into the store.

This lets us use <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> as an indirection point for information related to this little mailbox crawler hack. I don’t have to “pollute” the single default graph with this data. And because the uuid: was previously meaningless, it is something we might decided makes sense to use across data visibility boundaries, ie. you might use the same UUID in your own SPARQL store, so we can share queries and app logic.

Here’s a simple query. It says, “ask the mailbox crawler table of contents graph (which we call uuid:320d9etc…) for all things it knows about that are a Document”. Then it says “ask each of those documents, for everything in it”. And then the SELECT clause returns all the property URIs. This gives a first level overview of what’s in the Web of data files found by the crawl. Query was:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?p WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { ?s ?p ?o . }
}

ORDER BY ?p

I’ll just show the first page full of properties it found; for the rest see link to the complete set. Since W3C’s official SPARQL doesn’t have aggregates, we’d need to write application code (or use something like the SPARQL+ extensions) to get property usage counts. Here are some of the properties that were found in the data:

  http://kota.s12.xrea.com/vocab/uranaibloodtype

http://purl.org/dc/elements/1.1/creator

http://purl.org/dc/elements/1.1/description

http://purl.org/dc/elements/1.1/format

http://purl.org/dc/elements/1.1/title

http://purl.org/dc/terms/created

http://purl.org/dc/terms/modifed

http://purl.org/dc/terms/modified

http://purl.org/net/inkel/rdf/schemas/lang/1.1#masters

http://purl.org/net/inkel/rdf/schemas/lang/1.1#reads

http://purl.org/net/inkel/rdf/schemas/lang/1.1/masters

http://purl.org/net/inkel/rdf/schemas/lang/1.1/reads

http://purl.org/net/inkel/rdf/schemas/lang/1.1/speaks

http://purl.org/net/schemas/quaffing/drankBeerWith

http://purl.org/net/schemas/quaffing/drankLagerWith

http://purl.org/net/vocab/2004/07/visit#caregion

http://purl.org/net/vocab/2004/07/visit#country

http://purl.org/net/vocab/2004/07/visit#usstate

http://purl.org/ontology/mo/hasTrack

http://purl.org/ontology/mo/myspace

http://purl.org/ontology/mo/performed

So my little corner of the Web includes properties that extend FOAF documents to include blood types, countries that have been visited, language skills that people have, music information, and even drinking habits. But remember that this comes from my corner of the Web – people I’ve corresponded with – and probably isn’t indicative of the wider network. But that’s what grassroots decentralised data is all about. The folk who published this data didn’t need to ask permission of any committe to do so, they just mixed in what they wanted to say, alongside terms more widely used like foaf:Person, foaf:name. This is the way it should be: ask forgiveness, not permission, from the language lawyers and standardistas.

Ok, so let’s dig deeper into the messy data I crawled up from my sentmail contacts?

Here’s one that finds some photos, either using FOAF’s :img or :depiction properties:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled {
{ ?x :depiction ?y1 } UNION { ?x :img ?y2 } .
}
}

Here’s another that asks the crawl results for names and homepages it found:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE { GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { { [ :name ?n; :homepage ?h ] } }
}

To recap, the key point here is that social data in a SPARQL store will be rather chaotic. Information will often be missing, and often be extended. It will come from a variety of parties, some of whom you trust, some of whom you don’t know, and a few of whom will be actively malicious. Later down the line, subsets of the data will need different permissioning: if I export a family tree from ancestry.co.uk, I don’t want everyone to be able to do a SELECT for mother’s maiden name and my date of birth.

So what I suggest here, is that we can use UUID-named graphs as an organizing structure within an otherwise chaotic SPARQL environment. The demo here shows how one such graph can be used as a “table of contents” for other graphs associated with a particular app — in this case, the Google-mediated sentmail crawling app I made yesterday. Other named views might be: those data files from colleagues, those files that are plausibly PGP-signed, those that contain data structured according to some particular application need (eg. calendar, addressbook, photos, …).

One Response to Graph URIs in SPARQL: Using UUIDs as named views

  1. Why use UUIDs and not HTTP URIs?

    Regarding the indirection and named graph usage, this is what the ScutterVocab does (also in SparqlPress).

    The actual TOC stuff is a little more difficult, mostly with regards to duplicates, but I don’t see any problems in grouping (!) documents by “source”.

Leave a Reply