Outbox FOAF crawl: using the Google SG API to bring context to email

OK here’s a quick app I hacked up on my laptop largely in the back of a car, ie. took ~1 hour to get from idea to dataset.

Idea is the usual FOAFy schtick about taking an evidential rather than purely claim-based approach to the ‘social graph’.

We take the list of people I’ve emailed, and we probe the social data Web to scoop up user profile etc descriptions of them. Whether this is useful will depend upon how much information we can scoop up, of course. Eventually oauth or similar technology should come into play, for accessing non-public information. For now, we’re limited to data shared in the public Web.

Here’s what I did:

  1. I took the ‘sent mail’ folder on my laptop (a Mozilla Thunderbird installation).
  2. I built a list of all the email addresses I had sent mail to (in Cc: or To: fields). Yes, I grepped.
    • working assumption: these are humans I’m interacting with (ie. that I “foaf:know”)
    • I could also have looked for X-FOAF: headers in mail from them, but this is not widely deployed
    • I threw away anything that matched some company-related mail hosts I wanted to exclude.
  3. For each email address, scrubbed of noise, I prepend mailto: and generate a SHA1 value.
  4. The current Google SG API can be consulted about hashed mailboxes using URIs in the unregistered sgn: space, for example my hashed gmail address (danbrickley@gmail.com) gives sgn://mboxsha1/?pk=6e80d02de4cb3376605a34976e31188bb16180d0 … which can be dropped into the Google parameter playground. I’m talking with Brad Fitzpatrick about how best these URIs might be represented without inventing a new URI scheme, btw.
  5. Google returns a pile of information, indicating connections between URLs, mailboxes, hashes etc., whether verified or merely claimed.
  6. I crudely ignore all that, and simply parse out anything beginning with http:// to give me some starting points to crawl.
  7. I feed these to an RDF FOAF/XFN crawler attached to my SparqlPress-augmented WordPress blog.

With no optimisation, polish or testing, here (below) is the list of documents this process gave me. I have not yet investigated their contents.

The goal here is to find out more about the people who are filling my mailbox, so that better UI (eg. auto-sorting into folders, photos, task-clustering etc) could help make email more contextualised and sanity-preserving.  The next step I think is to come up with a grouping and permissioning model for this crawled FOAF/RDF in the WordPress SPARQL store, ie. a way of asking queries across all social graph data in the store that came from this particular workflow. My store has other semi-personal data, such as the results of crawling the groups of people who have successfuly commented in blogs and wikis that I run. I’m also looking at a Venn-diagram UI for presenting these different groups in a way that allows other groups (eg. “anyone I send mail to OR who commented in my wiki”) to be defined in terms of these evidence-driven primitive sets. But that’s another story.

FOAF files located with the help of the Google API:

 http://www.advogato.org/person/benadida/foaf.rdf

http://www.w3c.es/Personal/Martin/foaf.rdf

http://www.iandickinson.me.uk/rdf/foaf.rdf

http://people.tribe.net/xe0s242

http://www.geocities.com/rmarkwhite/foaf.xml

http://www.l3s.de/~diederich/foaf.rdf

http://www.w3.org/People/maxf/foaf

http://revyu.com/people/mhausenblas/about/rdf

http://www.cs.vu.nl/~laroyo/foaf.rdf

http://wwwis.win.tue.nl/~nstash/foaf.rdf

http://www.nimbustier.net/nimbustier/foaf.rdf

http://www.cubicgarden.com/webdav/profile/foaf.rdf

http://www.ibiblio.org/hhalpin/foaf.rdf

http://www.kjetil.kjernsmo.net/linkedin-contacts.rdf

http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user\u003ddino

http://www.kjetil.kjernsmo.net/friends.rdf

http://www.wikier.org/foaf.rdf

http://www.tribe.net/FOAF/4916e791-9793-40c9-875b-21cb25733a32

http://www.zonk.net/ghard/foaf.rdf

http://www.zonk.net/ghard/foaf.rdf

http://www.kjetil.kjernsmo.net/friends.rdf

http://www.kjetil.kjernsmo.net/linkedin-contacts.rdf

http://www.gnowsis.com/leo/foaf.xml

http://people.tribe.net/c09622fd-c817-4935-a039-5cb5d0d6bed8

http://www.few.vu.nl/~wrvhage/foaf.rdf

http://www.zonk.net/ghard/foaf.rdf

http://www.lri.fr/~pietriga/foaf.rdf

http://www.advogato.org/person/Pike/foaf.rdf

http://www.deri.ie/fileadmin/scripts/foaf.php?id\u003d12

For a quick hack, I’m really pleased with this. The sent-mail folder I used isn’t even my only one. A rewrite of the script over IMAP (and proprietary mail host APIs) could be quite fun.

Google Social Graph API, privacy and the public record

I’m digesting some of the reactions to Google’s recently announced Social Graph API. ReadWriteWeb ask whether this is a creeping privacy violation, and danah boyd has a thoughtful post raising concerns about whether the privileged tech elite have any right to experiment in this way with the online lives of those who are lack status, knowledge of these obscure technologies, and who may be amongst the more vulnerable users of the social Web.

While I tend to agree with Tim O’Reilly that privacy by obscurity is dead, I’m not of the “privacy is dead, get over it” school of thought. Tim argues,

The counter-argument is that all this data is available anyway, and that by making it more visible, we raise people’s awareness and ultimately their behavior. I’m in the latter camp. It’s a lot like the evolutionary value of pain. Search creates feedback loops that allow us to learn from and modify our behavior. A false sense of security helps bad actors more than tools that make information more visible.

There’s a danger here of technologists seeming to blame those we’re causing pain for. As danah says, “Think about whistle blowers, women or queer folk in repressive societies, journalists, etc.”. Not everyone knows their DTD from their TCP, or understand anything of how search engines, HTML or hyperlinks work. And many folk have more urgent things to focus on than learning such obscurities, let alone understanding the practical privacy, safety and reputation-related implications of their technology-mediated deeds.

Web technologists have responsibilities to the users of the Web, and while media education and literacy are important, those who are shaping and re-shaping the Web ought to be spending serious time on a daily basis struggling to come up with better ways of allowing humans to act and interact online without other parties snooping. The end of privacy by obscurity should not mean the death of privacy.

Privacy is not dead, and we will not get over it.

But it does need to be understood in the context of the public record. The reason I am enthusiastic about the Google work is that it shines a big bright light on the things people are currently putting into the public record. And it does so in a way that should allow people to build better online environments for those who do want their public actions visible, while providing immediate – and sometimes painful – feedback to those who have over-exposed themselves in the Web, and wish to backpedal.

I hope Google can put a user support mechanism on this. I know from our experience in the FOAF community, even with small scale and obscure aggregators, people will find themselves and demand to be “taken down”. While any particular aggregator can remove or hide such data, unless the data is tracked back to its source, it’ll crop up elsewhere in the Web.

I think the argument that FOAF and XFN are particularly special here is a big mistake. Web technologies used correctly (posh – “plain old semantic html” in microformats-speak) already facilitate such techniques. And Google is far from the only search engine in existence. Short of obfuscating all text inside images, personal data from these sites is readily harvestable.

ReadWriteWeb comment:

None the less, apparently the absence of XFN/FOAF data in your social network is no assurance that it won’t be pulled into the new Google API, either. The Google API page says “we currently index the public Web for XHTML Friends Network (XFN), Friend of a Friend (FOAF) markup and other publicly declared connections.” In other words, it’s not opt-in by even publishers – they aren’t required to make their information available in marked-up code.

The Web itself is built from marked-up code, and this is a thing of huge benefit to humanity. Both microformats and the Semantic Web community share the perspective that the Web’s core technologies (HTML, XHTML, XML, URIs) are properly consumed both by machines and by humans, and that any efforts to create documents that are usable only by (certain fortunate) humans is anti-social and discriminatory.

The Web Accessibility movement have worked incredibly hard over many years to encourage Web designers to create well marked up pages, where the meaning of the content is as mechanically evident as possible. The more evident the meaning of a document, the easier it is to repurpose it or present it through alternate means. This goal of device-independent, well marked up Web content is one that unites the accessibility, Mobile Web, Web 2.0, microformat and Semantic Web efforts. Perhaps the most obvious case is for blind and partially sighted users, but good markup can also benefit those with the inability to use a mouse or keyboard. Beyond accessibility, many millions of Web users (many poor, and in poor countries) will have access to the Web only via mobile phones. My former employer W3C has just published a draft document, “Experiences Shared by People with Disabilities and by People Using Mobile Devices”. Last month in Bangalore, W3C held a Workshop on the Mobile Web in Developing Countries (see executive summary).

I read both Tim’s post, and danah’s post, and I agree with large parts of what they’re both saying. But not quite with either of them, so all I can think to do is spell out some of my perhaps previously unarticulated assumptions.

  • There is no huge difference in principle between “normal” HTML Web pages and XFN or FOAF. Textual markup is what the Web is built from.
  • FOAF and XFN take some of the guesswork out of interpreting markup. But other technologies (javascript, perl, XSLT/GRDDL) can also transform vague markup into more machine-friendly markup. FOAF/XFN simply make this process easier and less heuristic, less error prone.
  • Google was not the first search engine, it is not the only search engine, and it will not be the last search engine. To obsess on Google’s behaviour here is to mistake Google for the Web.
  • Deeds that are on the public record in the Web may come to light months or years later; Google’s opening up of the (already public, but fragmented) Usenet historical record is a good example here.
  • Arguing against good markup practice on the Web (accessible, device independent markup) is something that may hurt underprivileged users (with disabilities, or limited access via mobile, high bandwidth costs etc).
  • Good markup allows content to be automatically summarised and re-presented to suit a variety of means of interaction and navigation (eg. voice browsers, screen readers, small screens, non-mouse navigation etc).
  • Good markup also makes it possible for search engines, crawlers and aggregators to offer richer services.

The difference between Google crawling FOAF/XFN from LiveJournal, versus extracting similar information via custom scripts from MySpace, is interesting and important solely to geeks. Mainstream users have no idea of such distinctions. When LiveJournal originally launched their FOAF files in 2004, the rule they followed was a pretty sensible one: if the information was there in the HTML pages, they’d also expose it in FOAF.

We need to be careful of taking a ruthless “you can’t make an omelete without breaking eggs” line here. Whatever we do, people will suffer. If the Web is made inaccessible, with information hidden inside image files or otherwise obfuscated, we exclude a huge constituency of users. If we shine a light on the public record, as Google have done, we’ll embarass, expose and even potentially risk harm to the people described by these interlinked documents. And if we stick our head in the sand and pretend that these folk aren’t exposed, I predict this will come back to bite us in the butt in a few months or years, since all that data is out there, being crawled, indexed and analysed by parties other than Google. Parties with less to lose, and more to gain.

So what to do? I think several activities need to happen in parallel:

  • Best practice codes for those who expose, and those who aggregate, social Web data
  • Improved media literacy education for those who are unwittingly exposing too much of themselves online
  • Technology development around decentralised, non-public record communication and community tools (eg. via Jabber/XMPP)

Any search engine at all, today, is capable of supporting the following bit of mischief:

Take some starting point a collection of user profiles on a public site. Extract all the usernames. Find the ones that appear in the Web less than say 10,000 times, and on other sites. Assume these are unique userIDs and crawl the pages they appear in, do some heuristic name matching, … and you’ll have a pile of smushed identities, perhaps linking professional and dating sites, or drunken college photos to respectable-new-life. No FOAF needed.

The answer I think isn’t to beat up on the aggregators, it’s to improve the Web experience such that people can have real privacy when they need it, rather than the misleading illusion of privacy. This isn’t going to be easy, but I don’t see a credible alternative.

A tale of two business models

Glancing back at 1998, thanks to the Wayback Machine.

W3C has Royalty Free licensing requirements and a public Process Document for good reason. I’ve been clicking back through the papertrails around the W3C P3P vs InterMind patent issue as a reminder. Here is the “Appendix C: W3C Licensing Plan Summary” from the old Intermind site:

We expect to license the patents to practice the P3P standard as it evolves over time as follows:

User Agents: For User Agent functionality, all commercial licensees will pay a royalty of 1% of revenues directly associated with the use of P3P or 0.1% of all revenues directly associated with the product employing the User Agent at the licensee’s option. We expect to measure revenues in a confidential and mutually acceptable manner with the licensee.

Service Providers: For Service Provider functionality, all commercial licensees will pay a royalty of 1% of revenues directly associated with the use of P3P. We expect to measure these revenues through the use of Web site logs, which will determine the percentage of P3P-related traffic and apply that percentage to the relevant Web site revenues (i.e., advertising-based revenues or transaction-based revenues). We expect to determine a method for monitoring or auditing such logs in a confidential and mutually acceptable manner with the licensee.

[...]

Intermind Corporation also expects to license the patents for CDF, ICE, and other XML-based agent technology on non-discriminatory terms. Members interested in further information on the relationship of Intermind’s patents to these technologies can contact Drummond Reed at drummond@intermind.com or 206-812-6000.

Nearby in the Web:

Cover Pages on Extensible Name Service (XNS):

Background: “In January 1999, the first of Intermind’s web agent patents began being issued (starting with U.S. patent No. 5,862,325). At the heart of this patent was a new naming and addressing service based on web agent technology. With the emergence of XML as a new global data interchange language — one perfectly suited to the requirements of a global ‘language’ for web agents — Intermind changed its name to OneName Corporation, built a new board and management team, and embarked on the development of this new global naming and addressing service. Because its use of XML as the foundation for all object representation and interchange led to the platform, yet had the same distributed architecture as DNS, it was christened eXtensible Name Service, or XNS. Recognizing the ultimate impact such a system may have on Internet infrastructure, and the crucial role that privacy, security, and trust must play, OneName also made the commitment to building it with open standards, open source software, and an open independent governance organization. Thus was born the XNS Public Trust Organization (XNSORG), the entity charged with setting the technical, operational, and legal standards for XNS.”

Over on XDIORG – Licenses and Agreements:

Summary of the XDI.ORG Intellectual Property Rights Agreement

NOTE: This summary is provided as a convenience to readers and is not intended in any way to modify or substitute for the full text of the agreement.

The purpose of the XDI.ORG IPR Agreement between XDI.ORG and OneName Corporation (dba Cordance) is to facilitate and promote the widespread adoption of XDI infrastructure by transfering the intellectual property rights underlying the XDI technology to a community-governed public trust organization.

The agreement grants XDI.ORG exclusive, worldwide, royalty-free license to a body of patents, trademarks, copyrights, and specifications developed by Cordance on the database linking technology underlying XDI. In turn, it requires that XDI.ORG manage these intellectual property rights in the public interest and make them freely available to the Internet community as royalty-free open standards. (It specifically adopts the definition provided by Bruce Perens which includes the ability for XDI.ORG to protect against “embrace and enhance” strategies.)

There is also a Global Service Provider aspect to this neutral, royalty-free standard. Another excerpt:

Summary of the XDI.ORG Global Service Provider Agreement

NOTE: This summary is provided as a convenience to readers and is not intended in any way to modify or substitute for the full text of the agreement.

Global Services are those XDI services offered by Global Service Providers (GSPs) based on the XRI Global Context Symbols (=, @, +, !) to facilitate interoperability of XDI data interchange among all users/members of the XDI community. XDI.ORG governs the provision of Global Services and has the authority to contract with GSPs to provide them to the XDI community.

For each Global Service, XDI.ORG may contract with a Primary GSP (similar to the operator of a primary nameserver in DNS) and any number of Secondary GSPs (similar to the operator of a secondary DNS nameserver). The Secondary GSPs mirror the Primary for loadbalancing and failover. Together, the Primary and Secondary GSPs operate the infrastructure for each Global Service according to the Global Services Specifications published and maintained by XDI.ORG.

The initial XDI.ORG GSP Agreement is between XDI.ORG and OneName Corporation (dba Cordance). The agreement specifies the rights and obligations of both XDI.ORG and Cordance with regard to developing and operating the first set of Global Services. For each of these services, the overall process is as follows:

  • If Cordance wishes to serve as the Primary GSP for a service, it must develop and contribute an initial Global Service Specification to XDI.ORG.
  • XDI.ORG will then hold a public review of the Global Service Specification and amend it as necessary.
  • Once XDI.ORG approves the Global Service Specification, Cordance must implement it in a commercially reasonable period. If Cordance is not able to implement or operate the service as required by the Global Service Specification, XDI.ORG may contract with another party to be the primary GSP.
  • XDI.ORG may contract with any number of Secondary GSPs.
  • If XDI.ORG desires to commence a new Global Service and Cordance does not elect to develop the Global Service Specification or provide the service, XDI.ORG is free to contract with another party.

The contract has a fifteen year term and covers a specified set of Global Services. Those services are divided into two classes: cost-based and fee-based. Cost-based services will be supplied by Cordance at cost plus 10%. Fee-based services will be supplied by Cordance at annual fees not to exceed maximums specified in the agreement. These fees are the wholesale cost to XDI.ORG; XDI.ORG will then add fees from any other GSPs supplying the service plus its own overhead fee to determine the wholesale price to registrars (registrars then set their own retail prices just as with DNS). Cordance’s wholesale fees are based on a sliding scale by volume and range from U.S. $5.40 down to $3.40 per year for global personal i-names and from U.S. $22.00 down to $13.50 per year for global organizational i-names.

The agreement also ensures all registrants of the original XNS Personal Name Service and Organizational Name Service have the right to convert their original XNS registration into a new XDI.ORG global i-name registration at no charge. This conversion period must last for at least 90 days after the commencement of the new global i-name service.

Over on inames.net, Become an i-broker:

What Is an I-Broker?

From Wikipedia:

“I-brokers are ‘bankers for data’ or ‘ISPs for identity services’–trusted third parties that help people and organizations share private data the same way banks help us exchange funds and ISPs help us exchange email and files.”

I-brokers are the core providers of XRI digital identity infrastructure. They not only provide i-name and i-number registration services, but also they provide i-services: a new layer of digital identity services that help people and business safely interact on the Internet. See the I-Service Directory for the first open-standard i-services that can be offered by any XDI.org-accredited i-broker.

How Does an I-Broker Become XDI.org-Accredited?

Cordance and NeuStar, together with XDI.org, have published a short guide to the process, “Becoming an I-Broker,” which includes all the information necessary to get started. It also includes contact information for the i-broker support teams at both companies.

Download Becoming an I-Broker

In addition the following two zip files contain all the documents needed by an i-broker. The first one contains the i-broker application and agreements, which are appendicies F and I of the XDI.org Global Services Specifications (GSS), located in their complete form at http://gss.xdi.org. The second one contains all the rest of the GSS documents referenced by the application and agreement.

From that downloadable PDF,

What is the application fee?
The standard application fee is USD $2500. However between the GRS opening on June 20th and the
start of Digital ID World 2006 on September 11, 2006, you may apply to Cordance to have the fee offset
by development and marketing commitments. For more information, contact Cordance or NeuStar at the
addresses below.

In the XDI.org GSS wiki,

6.1. Appendix B: Fee Schedules

Taking a look at that, we see:

Global Services Specification V1.0
Appendix B: Fee Schedule
Revised Version: 1 September, 2007

[...]

This and all contributions to XDI.org are open, public, royalty-free specifications licensed
under the XDI.org license available at http://www.xdi.org/docref/legal/xdi-org-license.html.

…where I find I can buy a personal i-name for $5/year, or a business i-name for 29.44, as in individual. Or as “become an i-broker” points out,

For volume pricing, contact Cordance or NeuStar at the addresses below.