Graph URIs in SPARQL: Using UUIDs as named views

I’ve been using the SPARQL query language to access a very ad-hoc collection of personal and social graph data, and thanks to Bengee’s ARC system this can sit inside my otherwise ordinary WordPress installation. At the moment, everything in there is public, but lately I’ve been discussing oauth with a few folk as a way of mediating access to selected subsets of that data. Which means the data store will need some way of categorising the dozens of misc data source URIs. There are a few ways to do this; here I try a slightly non-obvious approach.

Every SPARQL store can have many graphs inside, named by URI, plus optionally a default graph. The way I manage my store is a kind of structured chaos, with files crawled from links in my own data and my friends. One idea for indicating the structure of this chaos is to keep “table of contents” metadata in the default graph. For example, I might load up <http://danbri.org/foaf.rdf> into a SPARQL graph named with that URI. And I might load up <http://danbri.org/evilfoaf.rdf> into another graph, also using the retrieval URI to identify the data within my SPARQL store. Now, two points to make here: firstly, that the SPARQL spec does not mandate that we do things this way. An application that for example wanted to keep historical versions of a FOAF or RSS of schema document, could keep the triples from each version in a different named graph. Perhaps these might be named with UUIDs, for example. The second point, is that there are many different “meta” claims we want to store about our datasets. And that mixing them all into the store-wide “default graph” could be rather limiting, especially if we mightn’t want to unconditionally believe even those claims.

In the example above for example, I have data from running a PGP check against foaf.rdf (which passed) and evilfoaf.rdf (which doesn’t pass a check against my pgp identity). Now where do I store the results of this PGP checking? Perhaps the default graph, but maybe that’s messy. The idea I’m playing with here is that UUIDs are reasonable identifiers, and that perhaps we’ll find ourselves sharing common UUIDs across stores.

Go back to my sent-mail FOAF crawl example from yesterday. How far did I get? Well the end result was a list of URLs which I looped through, and loaded into my big chaotic SPARQL store. If I run the following query, I get a list of all the data graphs loaded:

SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s ?p ?o . } }

This reveals 54 URLs, basically everything I’ve loaded into ARC in the last month or so. Only 30 of these came from yesterday’s hack, which used Google’s new Social Graph API to allow me to map from hashed mailbox IDs to crawlable data URIs. So today’s game is to help me disentangle the 30 from the 54, and superimpose them on each other, but not always mixed with every other bit of information in the store. In other words, I’m looking for a flexible, query-based way of defining views into my personal data chaos.

So, what I tried. I took the result of yesterday’s hack, a file of data URIs called urls.txt. Then I modified my commandline dataloader script (yeah yeah this should be part of wordpress). My default data loader simply takes each URI, gets the data, and shoves it into the store under a graph name which was the URI used for retrieval. What I did today is, additionally, make a “table of contents” overview graph. To avoid worrying about names, I generated a UUID and used that. So there is a graph called <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> which contains simple asserts of the form:

<http://www.advogato.org/person/benadida/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.w3c.es/Personal/Martin/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> .
<http://www.iandickinson.me.uk/rdf/foaf.rdf> a <http://xmlns.com/foaf/0.1/Document> . # etc

…for each of the 30 files my crawler loaded into the store.

This lets us use <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> as an indirection point for information related to this little mailbox crawler hack. I don’t have to “pollute” the single default graph with this data. And because the uuid: was previously meaningless, it is something we might decided makes sense to use across data visibility boundaries, ie. you might use the same UUID in your own SPARQL store, so we can share queries and app logic.

Here’s a simple query. It says, “ask the mailbox crawler table of contents graph (which we call uuid:320d9etc…) for all things it knows about that are a Document”. Then it says “ask each of those documents, for everything in it”. And then the SELECT clause returns all the property URIs. This gives a first level overview of what’s in the Web of data files found by the crawl. Query was:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?p WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { ?s ?p ?o . }
}

ORDER BY ?p

I’ll just show the first page full of properties it found; for the rest see link to the complete set. Since W3C’s official SPARQL doesn’t have aggregates, we’d need to write application code (or use something like the SPARQL+ extensions) to get property usage counts. Here are some of the properties that were found in the data:

  http://kota.s12.xrea.com/vocab/uranaibloodtype

http://purl.org/dc/elements/1.1/creator

http://purl.org/dc/elements/1.1/description

http://purl.org/dc/elements/1.1/format

http://purl.org/dc/elements/1.1/title

http://purl.org/dc/terms/created

http://purl.org/dc/terms/modifed

http://purl.org/dc/terms/modified

http://purl.org/net/inkel/rdf/schemas/lang/1.1#masters

http://purl.org/net/inkel/rdf/schemas/lang/1.1#reads

http://purl.org/net/inkel/rdf/schemas/lang/1.1/masters

http://purl.org/net/inkel/rdf/schemas/lang/1.1/reads

http://purl.org/net/inkel/rdf/schemas/lang/1.1/speaks

http://purl.org/net/schemas/quaffing/drankBeerWith

http://purl.org/net/schemas/quaffing/drankLagerWith

http://purl.org/net/vocab/2004/07/visit#caregion

http://purl.org/net/vocab/2004/07/visit#country

http://purl.org/net/vocab/2004/07/visit#usstate

http://purl.org/ontology/mo/hasTrack

http://purl.org/ontology/mo/myspace

http://purl.org/ontology/mo/performed

So my little corner of the Web includes properties that extend FOAF documents to include blood types, countries that have been visited, language skills that people have, music information, and even drinking habits. But remember that this comes from my corner of the Web – people I’ve corresponded with – and probably isn’t indicative of the wider network. But that’s what grassroots decentralised data is all about. The folk who published this data didn’t need to ask permission of any committe to do so, they just mixed in what they wanted to say, alongside terms more widely used like foaf:Person, foaf:name. This is the way it should be: ask forgiveness, not permission, from the language lawyers and standardistas.

Ok, so let’s dig deeper into the messy data I crawled up from my sentmail contacts?

Here’s one that finds some photos, either using FOAF’s :img or :depiction properties:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE {
GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled {
{ ?x :depiction ?y1 } UNION { ?x :img ?y2 } .
}
}

Here’s another that asks the crawl results for names and homepages it found:

PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT * WHERE { GRAPH <uuid:420d9490-d73f-11dc-95ff-0800200c9a66> { ?crawled a :Document . }
GRAPH ?crawled { { [ :name ?n; :homepage ?h ] } }
}

To recap, the key point here is that social data in a SPARQL store will be rather chaotic. Information will often be missing, and often be extended. It will come from a variety of parties, some of whom you trust, some of whom you don’t know, and a few of whom will be actively malicious. Later down the line, subsets of the data will need different permissioning: if I export a family tree from ancestry.co.uk, I don’t want everyone to be able to do a SELECT for mother’s maiden name and my date of birth.

So what I suggest here, is that we can use UUID-named graphs as an organizing structure within an otherwise chaotic SPARQL environment. The demo here shows how one such graph can be used as a “table of contents” for other graphs associated with a particular app — in this case, the Google-mediated sentmail crawling app I made yesterday. Other named views might be: those data files from colleagues, those files that are plausibly PGP-signed, those that contain data structured according to some particular application need (eg. calendar, addressbook, photos, …).

Outbox FOAF crawl: using the Google SG API to bring context to email

OK here’s a quick app I hacked up on my laptop largely in the back of a car, ie. took ~1 hour to get from idea to dataset.

Idea is the usual FOAFy schtick about taking an evidential rather than purely claim-based approach to the ‘social graph’.

We take the list of people I’ve emailed, and we probe the social data Web to scoop up user profile etc descriptions of them. Whether this is useful will depend upon how much information we can scoop up, of course. Eventually oauth or similar technology should come into play, for accessing non-public information. For now, we’re limited to data shared in the public Web.

Here’s what I did:

  1. I took the ‘sent mail’ folder on my laptop (a Mozilla Thunderbird installation).
  2. I built a list of all the email addresses I had sent mail to (in Cc: or To: fields). Yes, I grepped.
    • working assumption: these are humans I’m interacting with (ie. that I “foaf:know”)
    • I could also have looked for X-FOAF: headers in mail from them, but this is not widely deployed
    • I threw away anything that matched some company-related mail hosts I wanted to exclude.
  3. For each email address, scrubbed of noise, I prepend mailto: and generate a SHA1 value.
  4. The current Google SG API can be consulted about hashed mailboxes using URIs in the unregistered sgn: space, for example my hashed gmail address (danbrickley@gmail.com) gives sgn://mboxsha1/?pk=6e80d02de4cb3376605a34976e31188bb16180d0 … which can be dropped into the Google parameter playground. I’m talking with Brad Fitzpatrick about how best these URIs might be represented without inventing a new URI scheme, btw.
  5. Google returns a pile of information, indicating connections between URLs, mailboxes, hashes etc., whether verified or merely claimed.
  6. I crudely ignore all that, and simply parse out anything beginning with http:// to give me some starting points to crawl.
  7. I feed these to an RDF FOAF/XFN crawler attached to my SparqlPress-augmented WordPress blog.

With no optimisation, polish or testing, here (below) is the list of documents this process gave me. I have not yet investigated their contents.

The goal here is to find out more about the people who are filling my mailbox, so that better UI (eg. auto-sorting into folders, photos, task-clustering etc) could help make email more contextualised and sanity-preserving.  The next step I think is to come up with a grouping and permissioning model for this crawled FOAF/RDF in the WordPress SPARQL store, ie. a way of asking queries across all social graph data in the store that came from this particular workflow. My store has other semi-personal data, such as the results of crawling the groups of people who have successfuly commented in blogs and wikis that I run. I’m also looking at a Venn-diagram UI for presenting these different groups in a way that allows other groups (eg. “anyone I send mail to OR who commented in my wiki”) to be defined in terms of these evidence-driven primitive sets. But that’s another story.

FOAF files located with the help of the Google API:

 http://www.advogato.org/person/benadida/foaf.rdf

http://www.w3c.es/Personal/Martin/foaf.rdf

http://www.iandickinson.me.uk/rdf/foaf.rdf

http://people.tribe.net/xe0s242

http://www.geocities.com/rmarkwhite/foaf.xml

http://www.l3s.de/~diederich/foaf.rdf

http://www.w3.org/People/maxf/foaf

http://revyu.com/people/mhausenblas/about/rdf

http://www.cs.vu.nl/~laroyo/foaf.rdf

http://wwwis.win.tue.nl/~nstash/foaf.rdf

http://www.nimbustier.net/nimbustier/foaf.rdf

http://www.cubicgarden.com/webdav/profile/foaf.rdf

http://www.ibiblio.org/hhalpin/foaf.rdf

http://www.kjetil.kjernsmo.net/linkedin-contacts.rdf

http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user\u003ddino

http://www.kjetil.kjernsmo.net/friends.rdf

http://www.wikier.org/foaf.rdf

http://www.tribe.net/FOAF/4916e791-9793-40c9-875b-21cb25733a32

http://www.zonk.net/ghard/foaf.rdf

http://www.zonk.net/ghard/foaf.rdf

http://www.kjetil.kjernsmo.net/friends.rdf

http://www.kjetil.kjernsmo.net/linkedin-contacts.rdf

http://www.gnowsis.com/leo/foaf.xml

http://people.tribe.net/c09622fd-c817-4935-a039-5cb5d0d6bed8

http://www.few.vu.nl/~wrvhage/foaf.rdf

http://www.zonk.net/ghard/foaf.rdf

http://www.lri.fr/~pietriga/foaf.rdf

http://www.advogato.org/person/Pike/foaf.rdf

http://www.deri.ie/fileadmin/scripts/foaf.php?id\u003d12

For a quick hack, I’m really pleased with this. The sent-mail folder I used isn’t even my only one. A rewrite of the script over IMAP (and proprietary mail host APIs) could be quite fun.

RDF in Ruby revisited

If you’re interested in collaborating on Ruby tools for RDF, please join the public-rdf-ruby@w3.org mailing list at W3C. Just send a note to public-rdf-ruby-request@w3.org with a subject line of “subscribe”.

Last weekend I had the fortune to run into Rich Kilmer at O’Reilly’s ‘Social graph Foo Camp‘ gathering. In addition to helping decorate my tent, Rich told me a bit more about the very impressive Semitar RDF and OWL work he’d done in Ruby, initially as part of the DAML programme. Matt Biddulph was also there, and we discussed again what it would take to include FOAF import into Dopplr. I’d be really happy to see that, both because of Matt’s long history of contributions to the Semantic Web scene, but also because Dopplr and FOAF share a common purpose. I’ve long said that a purpose of FOAF is to engineer more coincidences in the world, and Dopplr comes from the same perspective: to increase serendipity.

Now, the thing about engineering serendipity, is that it doesn’t work without good information flow. And the thing about good information flow, is that it benefits from data models that don’t assume the world around us comes nicely parceled into cleanly distinct domains. Borrowing from American Splendor – “ordinary life is pretty complex stuff“. No single Web site, service, document format or startup is enough; the trick comes when you hook things together in unexpected combinations. And that’s just what we did in the RDF world: created a model for mixed up, cross-domain data sharing.

Dopplr, Tripit, Fire Eagle and other travel and location services may know where you and others are. Social network sites (and there are more every day) knows something of who you are, and something of who you care about. And the big G in the sky knows something of the parts of this story that are on the public record.

Data will always be spread around. RDF is a handy model for ad-hoc data merging from multiple sources. But you can’t do much without an RDF parser and a few other tools. Minimally, an RDF/XML parser and a basic API for navigating the graph. There are many more things you could add. In my old RubyRdf work, I had in-memory and SQL-backed storage, with a Squish query interface to each. I had a donated RDF/XML parser (from Ruby4R) and a much-improved query engine (with support for optionals) from Damian Steer. But the system is code-rotted. I wrote it when I was learning Ruby beginning 7 years ago, and I think it is “one to throw away”. I’m really glad I took the time to declare that project “closed” so as to avoid discouraging others, but it is time to revisit Ruby and RDF again now.

Other tools have other offerings: Dave Beckett’s Redland system (written in C) ships with a Ruby wrapper. Dave’s tools probably have the best RDF parsing facilities around, are fast, but require native code. Rena is a pure Ruby library, which looked like a great start but doesn’t appear to have been developed further in recent years.

I could continue going through the list of libraries, but Paul Stadig has already done a great job of this recently (see also his conclusions, which make perfect sense). There has been a lot of creative work around RDF/RDFS and OWL in Ruby, and collectively we clearly have a lot of talent and code here. But collectively we also lack a finished product. It is a real shame when even an RDF-enthusiast like Matt Biddulph is not in a position to simply “gem install” enough RDF technology to get a simple job done. Let’s get this fixed. As I said above,

If you’re interested in collaborating on Ruby tools for RDF, please join the public-rdf-ruby@w3.org mailing list at W3C. Just send a note to public-rdf-ruby-request@w3.org with a subject line of “subscribe”.

In six months time, I’d like to see at least one solid, well rounded and modern RDF toolkit packaged as a Gem for the Ruby community. It should be able to parse RDF/XML flawlessy, and in addition to the usual unit tests, it should be wired up to the RDF Test Cases (see download) so we can all be assured it is robust. It should allow for a fast C parser such as Raptor to be used if available, falling back on pure Ruby otherwise. There should be a basic API that allows me to navigate an RDF graph of properties and values using clear, idiomatic Ruby. Where available, it should hook up to external stores of data, and include at least a SPARQL protocol client, eventually a full SPARQL implementation. It should allow multiple graphs to be super-imposed and disentangled. Some support for RDFS, OWL and rule languages would be a big plus. Support for other notations such as Turtle, RDFa, or XSLT-based GRDDL transforms would be useful, as would a plugin for microformat import. Transliterating Python code (such as the tiny Euler rule engine) should be considered. Divergence from existing APIs in Python (and Perl, Javascript, PHP etc) should be minimised, and carefully balanced against the pull of the Ruby way. And (thought I lack strong views here) it should be made available under a liberal opensource license that permits redistribution under GPL. It should also be as I18N and Unicode-friendly as is possible in Ruby these days.

I’m not saying that all RDF toolkits should be merged, or that collaboration is compulsory. But we are perilously fragmented right now, and collaboration can be fun. In six months time, people who simply want to use RDF from Ruby ought to be pleasantly suprised rather than frustrated when they take to the ‘net to see what’s out there. If it takes a year instead of six months, sure whatever. But not seven years! It would be great to see some movement again towards a common library…

How hard can it be?

Google Social Graph API, privacy and the public record

I’m digesting some of the reactions to Google’s recently announced Social Graph API. ReadWriteWeb ask whether this is a creeping privacy violation, and danah boyd has a thoughtful post raising concerns about whether the privileged tech elite have any right to experiment in this way with the online lives of those who are lack status, knowledge of these obscure technologies, and who may be amongst the more vulnerable users of the social Web.

While I tend to agree with Tim O’Reilly that privacy by obscurity is dead, I’m not of the “privacy is dead, get over it” school of thought. Tim argues,

The counter-argument is that all this data is available anyway, and that by making it more visible, we raise people’s awareness and ultimately their behavior. I’m in the latter camp. It’s a lot like the evolutionary value of pain. Search creates feedback loops that allow us to learn from and modify our behavior. A false sense of security helps bad actors more than tools that make information more visible.

There’s a danger here of technologists seeming to blame those we’re causing pain for. As danah says, “Think about whistle blowers, women or queer folk in repressive societies, journalists, etc.”. Not everyone knows their DTD from their TCP, or understand anything of how search engines, HTML or hyperlinks work. And many folk have more urgent things to focus on than learning such obscurities, let alone understanding the practical privacy, safety and reputation-related implications of their technology-mediated deeds.

Web technologists have responsibilities to the users of the Web, and while media education and literacy are important, those who are shaping and re-shaping the Web ought to be spending serious time on a daily basis struggling to come up with better ways of allowing humans to act and interact online without other parties snooping. The end of privacy by obscurity should not mean the death of privacy.

Privacy is not dead, and we will not get over it.

But it does need to be understood in the context of the public record. The reason I am enthusiastic about the Google work is that it shines a big bright light on the things people are currently putting into the public record. And it does so in a way that should allow people to build better online environments for those who do want their public actions visible, while providing immediate – and sometimes painful – feedback to those who have over-exposed themselves in the Web, and wish to backpedal.

I hope Google can put a user support mechanism on this. I know from our experience in the FOAF community, even with small scale and obscure aggregators, people will find themselves and demand to be “taken down”. While any particular aggregator can remove or hide such data, unless the data is tracked back to its source, it’ll crop up elsewhere in the Web.

I think the argument that FOAF and XFN are particularly special here is a big mistake. Web technologies used correctly (posh – “plain old semantic html” in microformats-speak) already facilitate such techniques. And Google is far from the only search engine in existence. Short of obfuscating all text inside images, personal data from these sites is readily harvestable.

ReadWriteWeb comment:

None the less, apparently the absence of XFN/FOAF data in your social network is no assurance that it won’t be pulled into the new Google API, either. The Google API page says “we currently index the public Web for XHTML Friends Network (XFN), Friend of a Friend (FOAF) markup and other publicly declared connections.” In other words, it’s not opt-in by even publishers – they aren’t required to make their information available in marked-up code.

The Web itself is built from marked-up code, and this is a thing of huge benefit to humanity. Both microformats and the Semantic Web community share the perspective that the Web’s core technologies (HTML, XHTML, XML, URIs) are properly consumed both by machines and by humans, and that any efforts to create documents that are usable only by (certain fortunate) humans is anti-social and discriminatory.

The Web Accessibility movement have worked incredibly hard over many years to encourage Web designers to create well marked up pages, where the meaning of the content is as mechanically evident as possible. The more evident the meaning of a document, the easier it is to repurpose it or present it through alternate means. This goal of device-independent, well marked up Web content is one that unites the accessibility, Mobile Web, Web 2.0, microformat and Semantic Web efforts. Perhaps the most obvious case is for blind and partially sighted users, but good markup can also benefit those with the inability to use a mouse or keyboard. Beyond accessibility, many millions of Web users (many poor, and in poor countries) will have access to the Web only via mobile phones. My former employer W3C has just published a draft document, “Experiences Shared by People with Disabilities and by People Using Mobile Devices”. Last month in Bangalore, W3C held a Workshop on the Mobile Web in Developing Countries (see executive summary).

I read both Tim’s post, and danah’s post, and I agree with large parts of what they’re both saying. But not quite with either of them, so all I can think to do is spell out some of my perhaps previously unarticulated assumptions.

  • There is no huge difference in principle between “normal” HTML Web pages and XFN or FOAF. Textual markup is what the Web is built from.
  • FOAF and XFN take some of the guesswork out of interpreting markup. But other technologies (javascript, perl, XSLT/GRDDL) can also transform vague markup into more machine-friendly markup. FOAF/XFN simply make this process easier and less heuristic, less error prone.
  • Google was not the first search engine, it is not the only search engine, and it will not be the last search engine. To obsess on Google’s behaviour here is to mistake Google for the Web.
  • Deeds that are on the public record in the Web may come to light months or years later; Google’s opening up of the (already public, but fragmented) Usenet historical record is a good example here.
  • Arguing against good markup practice on the Web (accessible, device independent markup) is something that may hurt underprivileged users (with disabilities, or limited access via mobile, high bandwidth costs etc).
  • Good markup allows content to be automatically summarised and re-presented to suit a variety of means of interaction and navigation (eg. voice browsers, screen readers, small screens, non-mouse navigation etc).
  • Good markup also makes it possible for search engines, crawlers and aggregators to offer richer services.

The difference between Google crawling FOAF/XFN from LiveJournal, versus extracting similar information via custom scripts from MySpace, is interesting and important solely to geeks. Mainstream users have no idea of such distinctions. When LiveJournal originally launched their FOAF files in 2004, the rule they followed was a pretty sensible one: if the information was there in the HTML pages, they’d also expose it in FOAF.

We need to be careful of taking a ruthless “you can’t make an omelete without breaking eggs” line here. Whatever we do, people will suffer. If the Web is made inaccessible, with information hidden inside image files or otherwise obfuscated, we exclude a huge constituency of users. If we shine a light on the public record, as Google have done, we’ll embarass, expose and even potentially risk harm to the people described by these interlinked documents. And if we stick our head in the sand and pretend that these folk aren’t exposed, I predict this will come back to bite us in the butt in a few months or years, since all that data is out there, being crawled, indexed and analysed by parties other than Google. Parties with less to lose, and more to gain.

So what to do? I think several activities need to happen in parallel:

  • Best practice codes for those who expose, and those who aggregate, social Web data
  • Improved media literacy education for those who are unwittingly exposing too much of themselves online
  • Technology development around decentralised, non-public record communication and community tools (eg. via Jabber/XMPP)

Any search engine at all, today, is capable of supporting the following bit of mischief:

Take some starting point a collection of user profiles on a public site. Extract all the usernames. Find the ones that appear in the Web less than say 10,000 times, and on other sites. Assume these are unique userIDs and crawl the pages they appear in, do some heuristic name matching, … and you’ll have a pile of smushed identities, perhaps linking professional and dating sites, or drunken college photos to respectable-new-life. No FOAF needed.

The answer I think isn’t to beat up on the aggregators, it’s to improve the Web experience such that people can have real privacy when they need it, rather than the misleading illusion of privacy. This isn’t going to be easy, but I don’t see a credible alternative.

Waving not Drowning? groups as buddylist filters

I’ve lately started writing up and prototyping around a use-case for the “Group” construct in FOAF and for medium-sized, partially private data aggregators like SparqlPress. I think we can do something interesting to deal with the social pressure and information load people are experiencing on sites like Flickr and Twitter.

Often people have rather large lists of friends, contacts or buddys – publically visible lists – which play a variety of roles in their online life. Sometimes these roles are in tension. Flickr, for example, allow their users to mark some of their contacts as “friend” or “family” (or both). Real life isn’t so simple, however. And the fact that this classification is shared (in Flickr’s case with everyone) makes for a potentially awkward dynamic. Do you really want to tell everyone you know whether they are a full “friend” or a mere “contact”? Let alone keep this information up to date on every social site that hosts it. I’ve heard a good few folk complain about the stress of adding yet another entry to their list of “twitter follows” people.  Their lists are often already huge through some sense of social obligation to reciprocate. I think it’s worth exploring some alternative filtering and grouping mechanisms.

On the one hand, we have people “bookmarking” people they find interesting, or want to stay in touch with, or “get to know better”. On the other, we have the bookmarked party sometimes reciprocating those actions because they feel it polite; a situation complicated by crude categories like “friend” versus “contact”. What makes this a particularly troublesome combination is when user-facing features, such as “updates from your buddies” or “photos from your friends/contacts” are built on top of these buddylists.

Take my own case on Flickr, I’m probably not typical, but I’m a use case I care about. I have made no real consistent use of the “friend” versus “contact” distinction; to even attempt do so would be massively time consuming, and also a bit silly. There are all kinds of great people on my Flickr contacts list. Some I know really well, some I barely know but admire their photography, etc. It seems currently my Flickr account has 7 “family”, 79 “friends” and 604 “contacts”.

Now to be clear, I’m not grumbling about Flickr. Flickr is a work of genius, no nitpicking. What I ask for is something that goes beyond what Flickr alone can do. I’d like a better way of seeing updates from friends and contacts. This is just a specific case of a general thing (eg. RSS/Atom feed management), but let’s stay specific for now.

Currently, my Flickr email notification settings are:

  • When people comment on your photos: Yes
  • When your friends and family upload new photos: Yes (daily digest)
  • When your other contacts upload new photos: No

What this means is that I selected to see notifications photos from those 86 people who I haveflagged as “friend” or “family”. And I chose not to be notified of new photos from the other 604 contacts. Even though that list contains many people I know, and would like to know better. The usability question here is: how can we offer more subtlety in this mechanism, without overwhelming users with detail? My guess is that we approach this by collecting in a more personal environment some information that users might not want to state publically in an online profile. So a desktop (or weblog, e.g. SparqlPress) aggregation of information from addressbooks, email send/receive patterns, weblog commenting behaviours, machine readable resumes, … real evidence of real connections between real people.

And so I’ve been mining around for sources of “foaf:Group” data. As a work in progress testbed I have a list of OpenIDs of people whose comments I’ve approved in my blog. And another, of people who have worked on the FOAF wiki. I’ve been looking at the machine readable data from W3C’s Tech Reports digital archive too, as that provides information about a network of collaboration going back to the early days of the Web (and it’s available in RDF). I also want to integrate my sent-mail logs, to generate another group, and extract more still from my addressbook. On top of this of course I have access to the usual pile of FOAF and XFN, scraped or API-extracted information from social network sites and IM accounts. Rather than publish it all for the world to see, the idea here is to use it to generate simple personal-use user interfaces couched in terms of these groups. So, hopefully I shall be able to use those groups as a filter against the 600+ members of my flickr buddylist, and make some kind of basic RSS-filtered view of their photos.

If this succeeds, it may help people to have huge buddylists without feeling they’ve betrayed their “real friends” by doing so, or forcing them to put their friends into crudely labelled public buckets marked “real friend” and “mere contact”. The core FOAF design principle here is our longstanding bias towards evidence-based rather than proclaimed information.  Don’t just ask people to tell you who their friends are, and how close the relationship is (especially in public, and most especially in a format that can be imported into spreadsheets for analysis). Instead… take the approach of  “don’t say it, show it“. Create tools based on the evidence that friendship and collaboration leaves in the Web. The public Web, but also the bits that don’t show up in Google, and hopefully never will. If we can find a way to combine all that behind a simple UI, it might go a little way towards “waving not drowning” in information.

Much of this is stating the obvious, but I thought I’d leave the nerdly details of SPARQL and FOAF/RDF for another post.

Instead here’s a pointer to a Scoble article on Mr Facebook, where they’re grappling with similar issues but from a massive aggregation rather than decentralised perspective:

Facebook has a limitation on the number of friends a person can have, which is 4,999 friends. He says that was due to scaling/technical issues and they are working on getting rid of that limitation. Partly to make it possible to have celebrities on Facebook but also to make it possible to have more grouping kinds of features. He told me many members were getting thousands of friends and that those users are also asking for many more features to put their friends into different groups. He expects to see improvements in that area this year.

ps. anyone know if Facebook group membership data is accessible through their APIs? I know I can get group data from LiveJournal and from My Opera in FOAF, but would be cool to mix in Facebook too.

Begin again

facebook grabThere was an old man named Michael Finnegan
He went fishing with a pinnegan
Caught a fish and dropped it in again
Poor old Michael Finnegan
Begin again.

Let me clear something up. Danny mentions a discussion with Tim O’Reilly about SemWeb themes.

Much as I generally agree with Danny, I’m reaching for a ten-foot bargepole on this one point:

While Facebook may have achieved pretty major adoption for their approach, it’s only very marginally useful because of their overly simplistic treatment of relationships.

Facebook, despite the trivia, the endless wars between the ninja zombies and the pirate vampires; despite being centralised, despite [insert grumble] is massively useful. Proof of that pudding: it is massively used. “Marginal” doesn’t come into it. The real question is: what happens next?

Imagine 35 million people. Imagine them marching thru your front room. Jumping off a table at the same time. Sending you an email. Or turning the tap off when they brush their teeth. 35 million is a fair-sized nation. Taking that 35 million figure I’ve heard waved around, and placing it in the ever scientific Wikipedia listing … that puts the land of Facebook somewhere between Kenya and Algeria in the population charts. Perhaps the figures are exagerrated. Perhaps a few million have wandered off, or forgotten their passwords. Doubtless some only use it every month or few.

Even a million is a lot of use; and a lot of usefulness.

Don’t let anything I ever say here in this blog be taken as claiming such sites and services are only marginally useful. To be used is to be useful; and that’s something SemWeb people should keep in the forefront of their minds. And usually they do, I think, although the community tends towards the forward-looking.

But let’s be backwards-looking for a minute. My concern with these sites is not that they’re marginally useful, but that they could be even more useful. Slight difference of emphasis. SixDegrees.com was great, back in 2000 when we started FOAF. But it was a walled garden. It had cool graph traversal stuff that evocatively showed your connection path to anyone else in the network. Their network. Then followed Friendster, which got slow as it proved useful to too many people. Ditto Orkut, which everyone signed up to, then wandered off from when it proved there was rather little to do there except add people. MySpace and Facebook cracked that one, … but guess what, there’ll be more.

I got a signup to Yahoo’s Mash yesterday. Anyone wanna be my friend? It has fun stuff (“Mecca Ibrahim smacked The Mash Pet (your Mash pet)!”), … wiki-like profile editing, extension modules … and I’d hope given that this is 2007, eventually some form of API. People won’t live in Facebook-land forever. Nor in Mash, however fun it is. I still lean towards Jabber/XMPP as the long-term infrastructure for this sort of system, but that’s for another time. The appeal of SixDegrees, of Friendster, of Orkut … wasn’t ever the technology. It was the people. I was there ‘cos others were there. Nothing more. And I don’t see this changing, no matter how much the underlying technology evolves. And people move around, drift along to the next shiny thing, … go wherever their friends are. Which is our only real problem here.

Begin again.

I’ve been messing with RDF a bit. I made a sample SPARQL query that asks (exported RDF from) a few networks about my IM addresses; here are the results from Redland/Rasqal JSON.

Loosly joined

find . -name danbri-\*.rdf -exec rapper –count {} \;


rapper: Parsing file ./facebook/danbri-fb.rdf
rapper: Parsing returned 2155 statements
rapper: Parsing file ./orkut/danbri-orkut.rdf
rapper: Parsing returned 848 statements
rapper: Parsing file ./dopplr/danbri-dopplr.rdf
rapper: Parsing returned 346 statements
rapper: Parsing file ./tribe.net/danbri-tribe.rdf
rapper: Parsing returned 71 statements
rapper: Parsing file ./my.opera.com/danbri-opera.rdf
rapper: Parsing returned 123 statements
rapper: Parsing file ./advogato/danbri-advogato.rdf
rapper: Parsing returned 18 statements
rapper: Parsing file ./livejournal/danbri-livejournal.rdf
rapper: Parsing returned 139 statements

I can run little queries against various descriptions of me and my friends, extracted from places in the Web where we hang out.

Since we’re not yet in the shiny OpenID future, I’m matching people only on name (and setting up the myfb: etc prefixes to point to the relevant RDF files). I should probably take more care around xml:lang, to make sure things match. But this was just a rough test…


SELECT DISTINCT ?n
FROM myfb:
FROM myorkut:
FROM dopplr:
WHERE {
GRAPH myfb: {[ a :Person; :name ?n; :depiction ?img ]}
GRAPH myorkut: {[ a :Person; :name ?n; :mbox_sha1sum ?hash ]}
GRAPH dopplr: {[ a :Person; :name ?n; :img ?i2]}
}

…finds 12 names in common across Facebook, Orkut and Dopplr networks. Between Facebook and Orkut, 46 names. Facebook and Dopplr: 34. Dopplr and Orkut: 17 in common. Haven’t tried the others yet, nor generated RDF for IM and Flickr, which I probably have used more than any of these sites. The Facebook data was exported using the app I described recently; the Orkut data was done via the CSV format dumps they expose (non-mechanisable since they use a CAPCHA), while the Dopplr list was generated with a few lines of Ruby and their draft API: I list as foaf:knows pairs of people who reciprocally share their travel plans. Tribe.net, LiveJournal, my.opera.com and Advogato expose RDF/FOAF directly. Re Orkut, I noticed that they now have the option to flood your GTalk Jabber/XMPP account roster with everyone you know on Orkut. Not sure the wisdom of actually doing so (but I’ll try it), but it is worth noting that this quietly bridges a large ‘social network ing’ site with an open standards-based toolset.

For the record, the names common to my Dopplr, Facebook and Orkut accounts were: Liz Turner, Tom Heath, Rohit Khare, Edd Dumbill, Robin Berjon, Libby Miller, Brian Kelly, Matt Biddulph, Danny Ayers, Jeff Barr, Dave Beckett, Mark Baker. If I keep adding to the query for each other site, presumably the only person in common across all accounts will be …. me.

Flickr’d

Just renewed my Flickr-Pro account for 2 years, ensuring an irregular supply of pigeon, fish and other misc depictions.

I wasn’t 100% happy with the wording of their terms though.

To participate in Flickr pro, you must have a valid Yahoo! ID and, solely if you have not received a free offer or gift for a specific number of days of Flickr pro (“Free pro Period”), you will also need to provide other information, such as your credit card and billing information (your “Registration Data”). If you do not have a Yahoo! ID, you will be prompted to complete the registration process for it before you can register for Flickr pro. In consideration of your use of Flickr pro, you agree to: (a) provide true, accurate, current and complete information about yourself and (b) maintain and promptly update the Registration Data to keep it true, accurate, current and complete. If you provide any information that is untrue, inaccurate, not current or incomplete, or Flickr has reasonable grounds to suspect that such information is untrue, inaccurate, not current or incomplete, Flickr has the right to suspend or terminate your account and delete any information or content therein without liability to Flickr.

The “provide true, accurate, current and complete information about yourself” is only contextually limited to “credit card” and “billing information”; it could also plausibly be read as covering the more general Flickr user profile, on which I’ve every right to omit various bits of information (Missing isn’t broken). The billing system also let me have the choice of storing credit card info or re-entering it again next time it’s used. So it isn’t really clear what they’re asking for here. If my buddy icon doesn’t show enough grey hair, is that inaccurate? :) I guess they’re really focussed on contact details, in which case, it’s best to say so.

I signed up anyways. The Flickr API and the RDF-oriented Perl backup library make it a more reliable option for my photos than my own little Ruby scripts ever were. Back 2-3 years ago I maintained the fantasy that I’d manage my own photos and their metadata; the big reason I switched to Flickr was the commenting/social side. It’s just too hard for per-person sites to maintain that level of interactivity and community (unless you’re super famous or beautiful or both). And a photo site without comments and community, for me, is kind of boring. For decentralists … perhaps some combination of extended RSS feed plus OpenID for comments could come closer these days; but before OpenID, I couldn’t ever see a way for commenting and annotation to be massmarket-friendly in a decentralised manner. And, well, also no need to be grudging: Flickr is a great product. I’ve definitely had my money’s worth…