Outbox FOAF crawl: using the Google SG API to bring context to email

OK here’s a quick app I hacked up on my laptop largely in the back of a car, ie. took ~1 hour to get from idea to dataset.

Idea is the usual FOAFy schtick about taking an evidential rather than purely claim-based approach to the ‘social graph’.

We take the list of people I’ve emailed, and we probe the social data Web to scoop up user profile etc descriptions of them. Whether this is useful will depend upon how much information we can scoop up, of course. Eventually oauth or similar technology should come into play, for accessing non-public information. For now, we’re limited to data shared in the public Web.

Here’s what I did:

  1. I took the ‘sent mail’ folder on my laptop (a Mozilla Thunderbird installation).
  2. I built a list of all the email addresses I had sent mail to (in Cc: or To: fields). Yes, I grepped.
    • working assumption: these are humans I’m interacting with (ie. that I “foaf:know”)
    • I could also have looked for X-FOAF: headers in mail from them, but this is not widely deployed
    • I threw away anything that matched some company-related mail hosts I wanted to exclude.
  3. For each email address, scrubbed of noise, I prepend mailto: and generate a SHA1 value.
  4. The current Google SG API can be consulted about hashed mailboxes using URIs in the unregistered sgn: space, for example my hashed gmail address (danbrickley@gmail.com) gives sgn://mboxsha1/?pk=6e80d02de4cb3376605a34976e31188bb16180d0 … which can be dropped into the Google parameter playground. I’m talking with Brad Fitzpatrick about how best these URIs might be represented without inventing a new URI scheme, btw.
  5. Google returns a pile of information, indicating connections between URLs, mailboxes, hashes etc., whether verified or merely claimed.
  6. I crudely ignore all that, and simply parse out anything beginning with http:// to give me some starting points to crawl.
  7. I feed these to an RDF FOAF/XFN crawler attached to my SparqlPress-augmented WordPress blog.

With no optimisation, polish or testing, here (below) is the list of documents this process gave me. I have not yet investigated their contents.

The goal here is to find out more about the people who are filling my mailbox, so that better UI (eg. auto-sorting into folders, photos, task-clustering etc) could help make email more contextualised and sanity-preserving.  The next step I think is to come up with a grouping and permissioning model for this crawled FOAF/RDF in the WordPress SPARQL store, ie. a way of asking queries across all social graph data in the store that came from this particular workflow. My store has other semi-personal data, such as the results of crawling the groups of people who have successfuly commented in blogs and wikis that I run. I’m also looking at a Venn-diagram UI for presenting these different groups in a way that allows other groups (eg. “anyone I send mail to OR who commented in my wiki”) to be defined in terms of these evidence-driven primitive sets. But that’s another story.

FOAF files located with the help of the Google API:






























For a quick hack, I’m really pleased with this. The sent-mail folder I used isn’t even my only one. A rewrite of the script over IMAP (and proprietary mail host APIs) could be quite fun.