SPARQL


Via momolondon list: opencellid.org data dumps

The readme.txt file describes the tabular data structure (split into a cells, and a measures file).

I think the cells data is the one most folk will be interested in re-using. Table headings are:

# id,lat,lon,mcc,mnc,lac,cellid,range,nbSamples,created_at,updated_at
For example:
7,44.8802,-0.526878,208,10,18122,32951790,0,2,2008-03-31 15:22:22,2008-04-07 08:57:33

This could be RDFized using something similar to the (802.11-centric) Wireless Ontology. Perhaps even using lqraps

More great music-related stuff from Yves Raimond. He’s just announced (on the Music ontology list) a D2RQ mapping of the MusicBrainz SQL into RDF and SPARQL. There’s a running instance of it on his site. The N3 mapping files are on the  motools sourceforge site.

Yves writes…

Added to the things that are available within the Zitgist mapping:

  •  SPARQL end point
  •  Support for tags
  • Supports a couple of advanced relationships (still working my way  through it, though)
  • Instrument taxonomy directly generated from the db, and related to performance events
  • Support for orchestras

This is pretty cool, since the original MusicBrainz RDF is rather dated (if it’s even still available). The new representations are much richer and probably also easier to maintain.

Nearby in the Web: discussion of RDF views into MySpace; and the RDB2RDF Incubator Group at W3C discussions are getting started (this group is looking at technology such as D2RQ which map non-RDF databases into our strange parallel world…)

Speaks, reads, writes
Stephanie Booth asks:

 I vaguely remember somebody telling me about some emerging “standard” (too big a word) for encoding language skills. Or was it a dream?

That would’ve been me, showing markup from the FOAFX beta from Paola Di Maio and friends, which explores the extension of FOAF with expertise information. This is part of the ExpertFinder discussions alongside the FOAF project (see also wiki, mailing list). FOAFX and the ExpertFinder community are looking at ways of extending FOAF to better describe people’s expertise; both self-described and externally accredited. This is at once a fascinating, important and terrifyingly hard to scope problem area. It touches on longstanding “Web of trust” themes, on educational metadata standards, and on the various ways of characterising topics or domains of expertise. In other words, in any such problem space, there will always be multiple ways of “doing it”. For example, here is how the Advogato community site characterises my expertise regarding opensource software: foaf.rdf (I’m in the Journeyer group, apparently; some weighted average of people’s judgements about me).

One thing FOAFX attempts is to describe language skills. For this, they extend the idiom proposed by Inkel some years ago in his “Speaks, Reads, Writesschema. In the original (which is Spanish, but see also English version), the classification was effectively binary: one could either speak, read, or write a language; or one couldn’t. You could also say you ‘mastered’ it, meaning that you could speak, read and write it. In FOAFX, this is handled differently: we get a 1-5 score. I like this direction, as it allows me to express that I have some basic capability in Spanish, without appearing to boast that I’m anything like “fluent”. But … am I a “1″ or a “2″? Should I poll my long-suffering Spanish-speaking friends? Take an online quiz? Introducing numbers gives the impression of mathematical precision, but in skill characterisation this is notoriously hard (and not without controversy).

My take here is that there’s no right thing to do. So progress and experimentation are to be celebrated, even if the solution isn’t perfect. On language skills, I’d love some way also to allow people to say “I’m learning language X”, or “I’m happy to help you practice your English/Spanish/Japanese/etc.”. Who knows, with more such information available, online Social Network sites could even prove useful…

Here btw is the current RDF markup generated by FOAFX:

<foaf:Person rdf:ID="me">
<foaf:mbox_sha1>6e80d02de4cb3376605a34976e31188bb16180d0</foaf:mbox_sha1>
<foaf:givenname>Dan</foaf:givenname>
<foaf:family_name>Brickley</foaf:family_name>
<foaf:homepage rdf:resource="http://danbri.org/" />
<foaf:weblog rdf:resource="http://danbri.org/words/" />
<foaf:depiction rdf:resource="http://danbri.org/images/me.jpg" />
<foaf:jabberID>danbrickley@gmail.com</foaf:jabberID>
<foafx:language>
<foafx:Language>
<foafx:name>English</foafx:name>
<foafx:speaking>5</foafx:speaking>
<foafx:reading>5</foafx:reading>
<foafx:writing>5</foafx:writing>
</foafx:Language>
</foafx:language>
<foafx:language>
<foafx:Language>
<foafx:name>Spanish</foafx:name>
<foafx:speaking>1</foafx:speaking>
<foafx:reading>1</foafx:reading>
<foafx:writing>1</foafx:writing>
</foafx:Language>
</foafx:language>
<foafx:expertise>
<foafx:Expertise>
<foafx:field>::</foafx:field>
<foafx:fluency>
<foafx:Language>
<foafx:name>English</foafx:name>
</foafx:Language>
</foafx:fluency>
</foafx:Expertise>
</foafx:expertise>
</foaf:Person>

The apparent redundancy in the markup (expertise, Expertise) is due to RDF’s so-called “striped” syntax. I have an old introduction to this idea; in short, RDF lets you define properties of things, and categories of thing. The FOAFX design effectively says, “there is a property of a person called “expertise” which relates that person to another thing, an “Expertise”, which itself has properties like “fluency”.

The FOAFX design tries to navigate between generic and specific, by including language-oriented markup as well as more generic skill descriptions. I think this is probably the right way to go. There are many things that we can say about human languages that don’t apply to other areas of expertise (eg. opensource software development). And there many things we can say about expertise in general (like expressions of willingness to learn, to teach, … indications of formal qualification) which are cross domain. Similarly, there are many things we might say in markup about opensource projects (picking up on my Advogato mention earlier) which have nothing to do with human languages. Yet both human language expertise and opensource skills are things we might want to express via FOAF extensions. For example, the DOAP project already lets us describe opensource projects and our roles in them.

The Semantic Web design challenge here is to provide a melting pot for all these different kinds of data, one that allows each specific problem to be solved adequately in a reasonable time-frame, without precluding the possibility for richer integration at a later date. I have a hunch that the Advogato design, which expresses skills in terms of group membership, could be a way to go here.

This is related to the idea of expressing group-membership criteria through writing SPARQL queries. For example, we can talk about the Group of people who work for W3C. Or we can talk about the Group of people who work for W3C as listed authoritatively on the W3C site. Both rules are expressible as queries; the latter a query that says things about the source of claims, as well as about what those claims assert. This notion of a group defined by a query allows for both flavours; the definition could include criteria relating to the provenance (ie. source) of the claims, but it needn’t. So we could express the idea of people who speak Spanish, or the idea of people who speak french according to having passed some particular test, or being certified by some agency. In either case, the unifying notion is “person X is in group Y”, where Y is a group identified by some URL. What I like about this model, is it allows for a very loose division of labour: skill-related markup is necessarily going to be widely varied. Yet the idea that such scattered evidence boils down to people falling into definable groups, gives some overall cohesion to this diversity. I could for example run a query asking for people with (foafx idiom) “Spanish skills of 2 or more”. I could add a constraint that the person be at least a “Journeyer” regarding their opensource skills, according to Advogato, or perhaps mix in data expressed in DOAP terms regarding their roles in opensource project work. These skills effectively define groups (loosly, sets) of people, and skill search can be pictured in venn diagram terms. Of course all this depends on getting enough data out there for any such queries to be worthwhile. Maybe a Facebook app that re-published data outside of Hotel Facebook would be a way of bootstrapping things here?

I’ve just published a quick writeup (with running toy Ruby code) of a “reverse SPARQL” utility called lqraps, a tool for re-constructing RDF from tabular data.

The idea is that such a tool is passed a tab-separated (eventually, CSV etc.) file, such as might conventionally be loaded into Access, spreadsheets etc. The tool looks for a few lines of special annotation in “#”-comments at the top of the file, and uses these to (re)generate RDF. My simple implementation does this with text-munging of SPARQL construct notation into Turtle. As the name suggests, this could be done against the result tables we get from doing SPARQL queries; however it is more generally applicable to tabular data files.

A richer version might be able to output RDF/XML, but that would require use of libraries for SPARQL and RDF. This is of course close in spirit to D2RQ and SquirrelRDF and other relational/RDF mapping efforts, but the focus here is on something simple enough that we could include it in the top section of actual files.

The correct pronunciation btw is “el craps”. As in the card game

Righto, it’s about time I wrote this one up. One of my last deeds at W3C before leaving at the end of 2005, was to begin the specification of an XMPP binding of the SPARQL querying protocol. For the acronym averse, a quick recap. XMPP is the name the IETF give to the Jabber messaging technology. And SPARQL is W3C’s RDF-based approach to querying mixed-up Web data. SPARQL defines a textual query language, an XML result-set format, and a JSON version for good measure. There is also a protocol for interacting with SPARQL databases; this defines an abstract interface, and a binding to HTTP. There is as-yet no official binding to XMPP/Jabber, and existing explorations are flawed. But I’ll argue here, the work is well worth completing.

jqbus diagram

So what do we have so far? Back in 2005, I was working in Java, Chris Schmidt in Python, and Steve Harris in Perl. Chris has a nice writeup of one of the original variants I’d proposed, which came out of my discussions with Peter St Andre. Chris also beat me in the race to have a working implementation, though I’ll attribute this to Python’s advantages over Java ;)

I won’t get bogged down in the protocol details here, except to note that Peter advised us to use IQ stanzas. That existing work has a few slight variants on the idea of sending a SPARQL query in one IQ packet, and returning all the results within another, and that this isn’t quite deployable as-is. When the result set is too big, we can run into practical (rather than spec-mandated) limits at the server-to-server layers. For example, Peter mentioned that jabber.org had a 65k packet limit. At SGFoo last week, someone suggested sending the results as an attachment instead; apparently this one of the uncountably many extension specs produced by the energetic Jabber community. The 2005 work was also somewhat partial, and didn’t work out the detail of having a full binding (eg. dealing with default graphs, named graphs etc).

That said, I think we’re onto something good. I’ll talk through the Java stuff I worked on, since I know it best. The code uses Ignite Online’s Smack API. I have published rough Java code that can communicate with instances of itself across Jabber. This was last updated July 2007, when I fixed it up to use more recent versions of Smack and Jena. I forget if the code to parse out query results from the responses was completed, but it does at least send SPARQL XML results back through the XMPP network.

sparql jabber interaction

sparql jabber interaction

So why is this interesting?

  • SPARQLing over XMPP can cut through firewalls/NAT and talk to data on the desktop
  • SPARQLing over XMPP happens in a social environment; queries are sent from and to Jabber addresses, while roster information is available which could be used for access control at various levels of granularity
  • XMPP is well suited for async interaction; stored queries could return results days or weeks later (eg. job search)
  • The DISO project is integrating some PHP XMPP code with Wordpress; SparqlPress is doing same with SPARQL

Both SPARQL and XMPP have mechanisms for batched results, it isn’t clear which, if either, to use here.

XMPP also has some service discovery mechanisms; I hope we’ll wire up a way to inspect each party on a buddylist roster, to see who has SPARQL data available. I made a diagram of this last summer, but no code to go with it yet. There is also much work yet to do on access control systems for SPARQL data, eg. using oauth. It is far from clear how to integrate SPARQL-wide ideas on that with the specific possibilities offered within an XMPP binding. One idea is for SPARQL-defined FOAF groups to be used to manage buddylist rosters, “friend groups”.

Where are we with code? I have a longer page in FOAF SVN for the Jqbus Java stuff, and a variant on this writeup (includes better alt text for the images and more detail). The Java code is available for download. Chris’s Python code is still up on his site. I doubt any of these can currently talk to each other properly, but they show how to deal with XMPP in different languages, which is useful in itself. For Perl people, I’ve uploaded a copy of Steve’s code.

The Java stuff has a nice GUI for debugging, thanks to Smack. I just tried a copy. Basically I can run a client and a server instance from the same filetree, passing it my LiveJournal and Google Talk jabber account details. The screenshot here shows the client on the left having the XML-encoded SPARQL results, and the server on the right displaying the query that arrived. That’s about it really. Nothing else ought to be that different from normal SPARQL querying, except that it is being done in an infrastructure that is more socially-grounded and decentralised than the classic HTTP Web service model.

JQbus debug

Over the last month or so, I’ve had a SPARQL store (just an ARC instance plus a loader script) running on the Amazon EC2 sandbox I set up for FOAF experiments. Yesterday I installed a fresh SparqlPress bundle into my own blog, which runs on another server. So how to get the data across? Since SparqlPress includes a scutter, which despite Urban Dictionary is just the FOAF term for RDF crawler, we can set it to crawl a Web of linked RDF. Here’s a SPARQL query on the former installation which sets up some pointers to each graph (scutters often follow rdfs:seeAlso, amongst other techniques). It’s interesting too because it summarises the data found in each graph, simply by listing properties.

CONSTRUCT {
<http://danbri.org/foaf.rdf#danbri> <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?g .
?g <http://purl.org/net/scutter/vocabItem> ?p .
}
WHERE {
GRAPH ?g { ?s ?p ?o }
}

The sc:vocabItem property is a work of fiction; I’ve not searched around to see if similar things already exist. The idea of linking together medium-sized repositories of metadata is an old one. The Harvest crawling/indexing system did this with RDM/SOIF. In a similar vein was WHOIS++, a now-obsolete directory system that some of us tried using for resource discovery in the late 90s - collections of Web site descriptions federated through having each server syndicate a summary of its content (record types, field names, and tokenized field values). I hadn’t really thought of it this way until I wrote this query, but SparqlPress could in some ways support a reimplementation of the Harvest and WHOIS++ design, if we have gatherer and broker roles attached to a network of blogs. All these ideas are of course close to the P2P scene too: little datasets exchanging summaries of their content to aid query routing. The main point here is that we can run a pretty simple SPARQL query, and get a summary of the contents of various named data graphs; the results of the query in turn give a summary of a database that has copies of these graphs.

I’ve been using the SPARQL query language to access a very ad-hoc collection of personal and social graph data, and thanks to Bengee’s ARC system this can sit inside my otherwise ordinary Wordpress installation. At the moment, everything in there is public, but lately I’ve been discussing oauth with a few folk as a way of mediating access to selected subsets of that data. Which means the data store will need some way of categorising the dozens of misc data source URIs. There are a few ways to do this; here I try a slightly non-obvious approach.

Every SPARQL store can have many graphs inside, named by URI, plus optionally a default graph. The way I manage my store is a kind of structured chaos, with files crawled from links in my own data and my friends. One idea for indicating the structure of this chaos is to keep “table of contents” metadata in the default graph. For example, I might load up <http://danbri.org/foaf.rdf> into a SPARQL graph named with that URI. And I might load up <http://danbri.org/evilfoaf.rdf> into another graph, also using the retrieval URI to identify the data within my SPARQL store. Now, two points to make here: firstly, that the SPARQL spec does not mandate that we do things this way. An application that for example wanted to keep historical versions of a FOAF or RSS of schema document, could keep the triples from each version in a different named graph. Perhaps these might be named with UUIDs, for example. The second point, is that there are many different “meta” claims we want to store about our datasets. And that mixing them all into the store-wide “default graph” could be rather limiting, especially if we mightn’t want to unconditionally believe even those claims.

In the example above for example, I have data from running a PGP check against foaf.rdf (which passed) and evilfoaf.rdf (which doesn’t pass a check against my pgp identity). Now where do I store the results of this PGP checking? Perhaps the default graph, but maybe that’s messy. The idea I’m playing with here is that UUIDs are reasonable identifiers, and that perhaps we’ll find ourselves sharing common UUIDs across stores.

Go back to my sent-mail FOAF crawl example from yesterday. How far did I get? Well the end result was a list of URLs which I looped through, and loaded into my big chaotic SPARQL store. If I run the following query, I get a list of all the data graphs loaded:

SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s ?p ?o . } }

This reveals 54 URLs, basically everything I’ve loaded into ARC in the last month or so. Only 30 of these came from yesterday’s hack, which used Google’s new Social Graph API to allow me to map from hashed mailbox IDs to crawlable data URIs. So today’s game is to help me disentangle the 30 from the 54, and superimpose them on each other, but not always mixed with every other bit of information in the store. In other words, I’m looking for a flexible, query-based way of defining views into my personal data chaos.

So, what I tried. I took the result of yesterday’s hack, a file of data URIs called urls.txt. Then I modified my commandline dataloader script (yeah yeah this should be part of wordpress). My default data loader simply takes each URI, gets the data, and shoves it into the store under a graph name which was the URI used for retrieval. What I did today is, additionally, make a “table of contents” overview graph. To avoid worrying about names, I generated a UUID and used that. So there is a graph called <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> which contains simple asserts of the form:

<http://www.advogato.org/person/benadida/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.w3c.es/Personal/Martin/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.iandickinson.me.uk/rdf/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> . # etc

…for each of the 30 files my crawler loaded into the store.

This lets us use <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> as an indirection point for information related to this little mailbox crawler hack. I don’t have to “pollute” the single default graph with this data. And because the uuid: was previously meaningless, it is something we might decided makes sense to use across data visibility boundaries, ie. you might use the same UUID in your own SPARQL store, so we can share queries and app logic.

Here’s a simple query. It says, “ask the mailbox crawler table of contents graph (which we call uuid:320d9etc…) for all things it knows about that are a Document”. Then it says “ask each of those documents, for everything in it”. And then the SELECT clause returns all the property URIs. This gives a first level overview of what’s in the Web of data files found by the crawl. Query was:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?p WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { ?s ?p ?o . }
}

ORDER BY ?p

I’ll just show the first page full of properties it found; for the rest see link to the complete set. Since W3C’s official SPARQL doesn’t have aggregates, we’d need to write application code (or use something like the SPARQL+ extensions) to get property usage counts. Here are some of the properties that were found in the data:

  http://kota.s12.xrea.com/vocab/uranaibloodtype
http://purl.org/dc/elements/1.1/creator
http://purl.org/dc/elements/1.1/description
http://purl.org/dc/elements/1.1/format
http://purl.org/dc/elements/1.1/title
http://purl.org/dc/terms/created
http://purl.org/dc/terms/modifed
http://purl.org/dc/terms/modified
http://purl.org/net/inkel/rdf/schemas/lang/1.1#masters
http://purl.org/net/inkel/rdf/schemas/lang/1.1#reads
http://purl.org/net/inkel/rdf/schemas/lang/1.1/masters
http://purl.org/net/inkel/rdf/schemas/lang/1.1/reads
http://purl.org/net/inkel/rdf/schemas/lang/1.1/speaks
http://purl.org/net/schemas/quaffing/drankBeerWith
http://purl.org/net/schemas/quaffing/drankLagerWith
http://purl.org/net/vocab/2004/07/visit#caregion
http://purl.org/net/vocab/2004/07/visit#country
http://purl.org/net/vocab/2004/07/visit#usstate
http://purl.org/ontology/mo/hasTrack
http://purl.org/ontology/mo/myspace
http://purl.org/ontology/mo/performed

So my little corner of the Web includes properties that extend FOAF documents to include blood types, countries that have been visited, language skills that people have, music information, and even drinking habits. But remember that this comes from my corner of the Web - people I’ve corresponded with - and probably isn’t indicative of the wider network. But that’s what grassroots decentralised data is all about. The folk who published this data didn’t need to ask permission of any committe to do so, they just mixed in what they wanted to say, alongside terms more widely used like foaf:Person, foaf:name. This is the way it should be: ask forgiveness, not permission, from the language lawyers and standardistas.

Ok, so let’s dig deeper into the messy data I crawled up from my sentmail contacts?

Here’s one that finds some photos, either using FOAF’s :img or :depiction properties:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled {
{ ?x :depiction ?y1 } UNION { ?x :img ?y2 } .
}
}

Here’s another that asks the crawl results for names and homepages it found:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE { GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { { [ :name ?n; :homepage ?h ] } }
}

To recap, the key point here is that social data in a SPARQL store will be rather chaotic. Information will often be missing, and often be extended. It will come from a variety of parties, some of whom you trust, some of whom you don’t know, and a few of whom will be actively malicious. Later down the line, subsets of the data will need different permissioning: if I export a family tree from ancestry.co.uk, I don’t want everyone to be able to do a SELECT for mother’s maiden name and my date of birth.

So what I suggest here, is that we can use UUID-named graphs as an organizing structure within an otherwise chaotic SPARQL environment. The demo here shows how one such graph can be used as a “table of contents” for other graphs associated with a particular app — in this case, the Google-mediated sentmail crawling app I made yesterday. Other named views might be: those data files from colleagues, those files that are plausibly PGP-signed, those that contain data structured according to some particular application need (eg. calendar, addressbook, photos, …).

If you’re interested in collaborating on Ruby tools for RDF, please join the public-rdf-ruby@w3.org mailing list at W3C. Just send a note to public-rdf-ruby-request@w3.org with a subject line of “subscribe”.

Last weekend I had the fortune to run into Rich Kilmer at O’Reilly’s ‘Social graph Foo Camp‘ gathering. In addition to helping decorate my tent, Rich told me a bit more about the very impressive Semitar RDF and OWL work he’d done in Ruby, initially as part of the DAML programme. Matt Biddulph was also there, and we discussed again what it would take to include FOAF import into Dopplr. I’d be really happy to see that, both because of Matt’s long history of contributions to the Semantic Web scene, but also because Dopplr and FOAF share a common purpose. I’ve long said that a purpose of FOAF is to engineer more coincidences in the world, and Dopplr comes from the same perspective: to increase serendipity.

Now, the thing about engineering serendipity, is that it doesn’t work without good information flow. And the thing about good information flow, is that it benefits from data models that don’t assume the world around us comes nicely parceled into cleanly distinct domains. Borrowing from American Splendor - “ordinary life is pretty complex stuff“. No single Web site, service, document format or startup is enough; the trick comes when you hook things together in unexpected combinations. And that’s just what we did in the RDF world: created a model for mixed up, cross-domain data sharing.

Dopplr, Tripit, Fire Eagle and other travel and location services may know where you and others are. Social network sites (and there are more every day) knows something of who you are, and something of who you care about. And the big G in the sky knows something of the parts of this story that are on the public record.

Data will always be spread around. RDF is a handy model for ad-hoc data merging from multiple sources. But you can’t do much without an RDF parser and a few other tools. Minimally, an RDF/XML parser and a basic API for navigating the graph. There are many more things you could add. In my old RubyRdf work, I had in-memory and SQL-backed storage, with a Squish query interface to each. I had a donated RDF/XML parser (from Ruby4R) and a much-improved query engine (with support for optionals) from Damian Steer. But the system is code-rotted. I wrote it when I was learning Ruby beginning 7 years ago, and I think it is “one to throw away”. I’m really glad I took the time to declare that project “closed” so as to avoid discouraging others, but it is time to revisit Ruby and RDF again now.

Other tools have other offerings: Dave Beckett’s Redland system (written in C) ships with a Ruby wrapper. Dave’s tools probably have the best RDF parsing facilities around, are fast, but require native code. Rena is a pure Ruby library, which looked like a great start but doesn’t appear to have been developed further in recent years.

I could continue going through the list of libraries, but Paul Stadig has already done a great job of this recently (see also his conclusions, which make perfect sense). There has been a lot of creative work around RDF/RDFS and OWL in Ruby, and collectively we clearly have a lot of talent and code here. But collectively we also lack a finished product. It is a real shame when even an RDF-enthusiast like Matt Biddulph is not in a position to simply “gem install” enough RDF technology to get a simple job done. Let’s get this fixed. As I said above,

If you’re interested in collaborating on Ruby tools for RDF, please join the public-rdf-ruby@w3.org mailing list at W3C. Just send a note to public-rdf-ruby-request@w3.org with a subject line of “subscribe”.

In six months time, I’d like to see at least one solid, well rounded and modern RDF toolkit packaged as a Gem for the Ruby community. It should be able to parse RDF/XML flawlessy, and in addition to the usual unit tests, it should be wired up to the RDF Test Cases (see download) so we can all be assured it is robust. It should allow for a fast C parser such as Raptor to be used if available, falling back on pure Ruby otherwise. There should be a basic API that allows me to navigate an RDF graph of properties and values using clear, idiomatic Ruby. Where available, it should hook up to external stores of data, and include at least a SPARQL protocol client, eventually a full SPARQL implementation. It should allow multiple graphs to be super-imposed and disentangled. Some support for RDFS, OWL and rule languages would be a big plus. Support for other notations such as Turtle, RDFa, or XSLT-based GRDDL transforms would be useful, as would a plugin for microformat import. Transliterating Python code (such as the tiny Euler rule engine) should be considered. Divergence from existing APIs in Python (and Perl, Javascript, PHP etc) should be minimised, and carefully balanced against the pull of the Ruby way. And (thought I lack strong views here) it should be made available under a liberal opensource license that permits redistribution under GPL. It should also be as I18N and Unicode-friendly as is possible in Ruby these days.

I’m not saying that all RDF toolkits should be merged, or that collaboration is compulsory. But we are perilously fragmented right now, and collaboration can be fun. In six months time, people who simply want to use RDF from Ruby ought to be pleasantly suprised rather than frustrated when they take to the ‘net to see what’s out there. If it takes a year instead of six months, sure whatever. But not seven years! It would be great to see some movement again towards a common library…

How hard can it be?

I’ve lately started writing up and prototyping around a use-case for the “Group” construct in FOAF and for medium-sized, partially private data aggregators like SparqlPress. I think we can do something interesting to deal with the social pressure and information load people are experiencing on sites like Flickr and Twitter.

Often people have rather large lists of friends, contacts or buddys - publically visible lists - which play a variety of roles in their online life. Sometimes these roles are in tension. Flickr, for example, allow their users to mark some of their contacts as “friend” or “family” (or both). Real life isn’t so simple, however. And the fact that this classification is shared (in Flickr’s case with everyone) makes for a potentially awkward dynamic. Do you really want to tell everyone you know whether they are a full “friend” or a mere “contact”? Let alone keep this information up to date on every social site that hosts it. I’ve heard a good few folk complain about the stress of adding yet another entry to their list of “twitter follows” people.  Their lists are often already huge through some sense of social obligation to reciprocate. I think it’s worth exploring some alternative filtering and grouping mechanisms.

On the one hand, we have people “bookmarking” people they find interesting, or want to stay in touch with, or “get to know better”. On the other, we have the bookmarked party sometimes reciprocating those actions because they feel it polite; a situation complicated by crude categories like “friend” versus “contact”. What makes this a particularly troublesome combination is when user-facing features, such as “updates from your buddies” or “photos from your friends/contacts” are built on top of these buddylists.

Take my own case on Flickr, I’m probably not typical, but I’m a use case I care about. I have made no real consistent use of the “friend” versus “contact” distinction; to even attempt do so would be massively time consuming, and also a bit silly. There are all kinds of great people on my Flickr contacts list. Some I know really well, some I barely know but admire their photography, etc. It seems currently my Flickr account has 7 “family”, 79 “friends” and 604 “contacts”.

Now to be clear, I’m not grumbling about Flickr. Flickr is a work of genius, no nitpicking. What I ask for is something that goes beyond what Flickr alone can do. I’d like a better way of seeing updates from friends and contacts. This is just a specific case of a general thing (eg. RSS/Atom feed management), but let’s stay specific for now.

Currently, my Flickr email notification settings are:

  • When people comment on your photos: Yes
  • When your friends and family upload new photos: Yes (daily digest)
  • When your other contacts upload new photos: No

What this means is that I selected to see notifications photos from those 86 people who I haveflagged as “friend” or “family”. And I chose not to be notified of new photos from the other 604 contacts. Even though that list contains many people I know, and would like to know better. The usability question here is: how can we offer more subtlety in this mechanism, without overwhelming users with detail? My guess is that we approach this by collecting in a more personal environment some information that users might not want to state publically in an online profile. So a desktop (or weblog, e.g. SparqlPress) aggregation of information from addressbooks, email send/receive patterns, weblog commenting behaviours, machine readable resumes, … real evidence of real connections between real people.

And so I’ve been mining around for sources of “foaf:Group” data. As a work in progress testbed I have a list of OpenIDs of people whose comments I’ve approved in my blog. And another, of people who have worked on the FOAF wiki. I’ve been looking at the machine readable data from W3C’s Tech Reports digital archive too, as that provides information about a network of collaboration going back to the early days of the Web (and it’s available in RDF). I also want to integrate my sent-mail logs, to generate another group, and extract more still from my addressbook. On top of this of course I have access to the usual pile of FOAF and XFN, scraped or API-extracted information from social network sites and IM accounts. Rather than publish it all for the world to see, the idea here is to use it to generate simple personal-use user interfaces couched in terms of these groups. So, hopefully I shall be able to use those groups as a filter against the 600+ members of my flickr buddylist, and make some kind of basic RSS-filtered view of their photos.

If this succeeds, it may help people to have huge buddylists without feeling they’ve betrayed their “real friends” by doing so, or forcing them to put their friends into crudely labelled public buckets marked “real friend” and “mere contact”. The core FOAF design principle here is our longstanding bias towards evidence-based rather than proclaimed information.  Don’t just ask people to tell you who their friends are, and how close the relationship is (especially in public, and most especially in a format that can be imported into spreadsheets for analysis). Instead… take the approach of  “don’t say it, show it“. Create tools based on the evidence that friendship and collaboration leaves in the Web. The public Web, but also the bits that don’t show up in Google, and hopefully never will. If we can find a way to combine all that behind a simple UI, it might go a little way towards “waving not drowning” in information.

Much of this is stating the obvious, but I thought I’d leave the nerdly details of SPARQL and FOAF/RDF for another post.

Instead here’s a pointer to a Scoble article on Mr Facebook, where they’re grappling with similar issues but from a massive aggregation rather than decentralised perspective:

Facebook has a limitation on the number of friends a person can have, which is 4,999 friends. He says that was due to scaling/technical issues and they are working on getting rid of that limitation. Partly to make it possible to have celebrities on Facebook but also to make it possible to have more grouping kinds of features. He told me many members were getting thousands of friends and that those users are also asking for many more features to put their friends into different groups. He expects to see improvements in that area this year.

ps. anyone know if Facebook group membership data is accessible through their APIs? I know I can get group data from LiveJournal and from My Opera in FOAF, but would be cool to mix in Facebook too.

Next Page »