Bruce Schneier: Our Data, Ourselves

Via Libby; Bruce Schneier on data:

In the information age, we all have a data shadow.

We leave data everywhere we go. It’s not just our bank accounts and stock portfolios, or our itemized bills, listing every credit card purchase and telephone call we make. It’s automatic road-toll collection systems, supermarket affinity cards, ATMs and so on.

It’s also our lives. Our love letters and friendly chat. Our personal e-mails and SMS messages. Our business plans, strategies and offhand conversations. Our political leanings and positions. And this is just the data we interact with. We all have shadow selves living in the data banks of hundreds of corporations’ information brokers — information about us that is both surprisingly personal and uncannily complete — except for the errors that you can neither see nor correct.

What happens to our data happens to ourselves.

This shadow self doesn’t just sit there: It’s constantly touched. It’s examined and judged. When we apply for a bank loan, it’s our data that determines whether or not we get it. When we try to board an airplane, it’s our data that determines how thoroughly we get searched — or whether we get to board at all. If the government wants to investigate us, they’re more likely to go through our data than they are to search our homes; for a lot of that data, they don’t even need a warrant.

Who controls our data controls our lives. [...]

Increasingly, we’re going to be seeing this data flow through protocols like OAuth. SemWeb people should get their heads around how this is likely to work. It’s rather likely we’ll see SPARQL data stores with non-public personal data flowing through them; what worries me is that there’s not yet any data management discipline on top of this that’ll help us keep track of who is allowed to see what, and which graphs should be deleted or refreshed at which times.

I recently transcribed some notes from a Robert Scoble post about Facebook and data portability into the FOAF wiki. In it, Scoble reported some comments from Dave Morin of Facebook, regardling data flow. Excerpts:

For instance, what if a user wants to delete his or her info off of Facebook. Today that’s possible. But what about in a really data portable world? After all, in such a world Facebook might have sprayed your email and other data to other social networks. What if those other social networks don’t want to delete your data after you asked Facebook to?

Another case: you want your closest Facebook friends to know your birthday, but not everyone else. How do you make your social network data portable, but make sure that your privacy is secured?

Another case? Which of your data is yours? Which belongs to your friends? And, which belongs to the social network itself? For instance, we can say that my photos that I put on Facebook are mine and that they should also be shared with, say, Flickr or SmugMug, right? How about the comments under those photos? The tags? The privacy data that was entered about them? The voting data? And other stuff that other users might have put onto those photos? Is all of that stuff supposed to be portable? (I’d argue no, cause how would a comment left by a Facebook user on Facebook be good on Flickr?) So, if you argue no, where is the line? And, even if we can all agree on where the line is, how do we get both Facebook and Flickr to build the APIs needed to make that happen?

I’d like to see SPARQL stores that can police their data access behaviour, with clarity for each data graph in the store about the contexts in which that data can be re-exposed, and the schedule by which the data should be refreshed or purged. Making it easy for data to flow is only half the problem…

Opening and closing like flowers (social platform roundupathon)

Closing some tabs…

Stephen Fry writing on ‘social network’ sites back in January (also in the Guardian):

…what an irony! For what is this much-trumpeted social networking but an escape back into that world of the closed online service of 15 or 20 years ago? Is it part of some deep human instinct that we take an organism as open and wild and free as the internet, and wish then to divide it into citadels, into closed-border republics and independent city states? The systole and diastole of history has us opening and closing like a flower: escaping our fortresses and enclosures into the open fields, and then building hedges, villages and cities in which to imprison ourselves again before repeating the process once more. The internet seems to be following this pattern.

How does this help us predict the Next Big Thing? That’s what everyone wants to know, if only because they want to make heaps of money from it. In 1999 Douglas Adams said: “Computer people are the last to guess what’s coming next. I mean, come on, they’re so astonished by the fact that the year 1999 is going to be followed by the year 2000 that it’s costing us billions to prepare for it.”

But let the rise of social networking alert you to the possibility that, even in the futuristic world of the net, the next big thing might just be a return to a made-over old thing.

McSweenys:

Dear Mr. Zuckerberg,

After checking many of the profiles on your website, I feel it is my duty to inform you that there are some serious errors present. [...]

Lest-we-forget. AOL search log privacy goofup from 2006:

No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men” to “dog that urinates on everything.”

And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.”

It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. “Those are my searches,” she said, after a reporter read part of the list to her.

Time magazine punditising on iGoogle, Facebook and OpenSocial:

Google, which makes its money on a free and open Web, was not happy with the Facebook platform. That’s because what happens on Facebook stays on Facebook. Google would much prefer that you come out and play on its platform — the wide-open Web. Don’t stay behind Facebook’s closed doors! Hie thee to the Web and start searching for things. That’s how Google makes its money.

So, last fall, Google rallied all the other major social networks (MySpace, Bebo, Hi5 and so on) and announced a new initiative called OpenSocial. OpenSocial wants to be like Facebook’s platform, only much bigger: Widget makers can write applications for it and they can run anywhere — on MySpace, Bebo and Google’s own social network, Orkut, which is very big in Brazil.

Google’s platform could actually dwarf Facebook — if it ever gets off the ground.

Meanwhile on the widget and webapp security front, we have “BBC exposes Facebook flaw” (information about your buddies is accessible to apps you install; information about you is accessible to apps they install). Also see Thomas Roessler’s comments to my Nokiana post for links to a couple of great presentations he made on widget security. This includes a big oopsie with the Google Mail widget for MacOSX. Over in Ars Technica we learn that KDE 4.1 alpha 1 now has improved widget powers, including “preliminary support for SuperKaramba and Mac OS X Dashboard widgets“. Wonder if I can read my Gmail there…

As Stephen Fry says,  these things are “opening and closing like a flower”. The big hosted social sites have a certain oversimplifying retardedness about them. But the ability for code to go visit data (the widget/gadget model), is I think as valid as the opendata model where data flows around to visit code. I am optimistic that good things will come out of this ferment.

A few weeks ago I had the pleasure of meeting several of the Google OpenSocial crew in London. They took my grumbling about accessibility issues pretty well, and I hope to continue that conversation. Industry politics and punditry aside, I’m impressed with their professionalism and with the tie-in to an opensource implementation through Apache’s ShinDig project. The OpenSocial specs list is open to the public, where Cassie has just announced that “all 0.8 opensocial and gadgets spec changes have been resolved” (after a heroic slog through the issue list). I’m barely tracking the detail of discussion there, things are moving fast. There’s now a proposed REST API, for example; and I learned in London about plans for a formatting/templating system, which might be one mechanism for getting FOAF/RDF out of OpenSocial containers.

If OpenSocial continues to grow and gather opensource mindshare, it’s possible Facebook will throw some chunks of their platform over the wall (ie. “do an Adobe“). And it’ll probably be left to W3C to clean up the ensuring mess and fragmentation, but I guess that’s what they’re there for. Meanwhile there’s plenty yet to be figured out, … I think we’re in a pre-standards experimentation phase, regardless of how stable or mature we’re told these platforms are.

The fundamental tension here is that we want open data, open platforms, … for data and code to flow freely, but to protect the privacy, lives and blushes of those it describes. A tricky balance. Don’t let anyone tell you it’s easy, that we’ve got it figured out, or that all we need to do is “tear down the walls”.

Opening and closing like flowers…

Imagemap magic

I’ve always found HTML imagemaps to be a curiously neglected technology. They seem somehow to evoke the Web of the mid-to-late 90s, to be terribly ‘1.0’. But there’s glue in the old horse yet…

A client-side HTML imagemap lets you associate links (and via Javascript, behaviour) with regions of an image. As such, they’re a form of image metadata that can have applications including image search, Web accessibility and social networking. They’re also a poor cousin to the Web’s new vector image format, SVG. This morning I dug out some old work on this (much of which from Max, Libby, Jim all of whom btw are currently working at Joost; as am I, albeit part-time).

The first hurdle you hit when you want to play with HTML imagemaps is finding an editor that produces them. The fact that my blog post asking for MacOSX HTML imagemap editors is now top Google hit for “MacOSX HTML imagemap” pretty much says it all. Eventually I found (and paid for) one called YokMak that seems OK.

So the first experiment here, was to take a picture (of me) and make a simple HTML imagemap.

danbri being imagemapped

As a step towards treating this as re-usable metadata, here’s imagemap2svg.xslt from Max back in 2002. The results of running it with xsltproc are online: _output.svg (you need an SVG-happy browser). Firefox, Safari and Opera seem more or less happy with it (ie. they show the selected area against a pink background). This shows that imagemap data can be freed from the clutches of HTML, and repurposed. You can do similar things server-side using Apache Batik, a Java SVG toolkit. There are still a few 2002 examples floating around, showing how bits of the image can be described in RDF that includes imagemap info, and then manipulated using SVG tools driven from metadata.

Once we have this ability to pick out a region of an image (eg. photo) and tag it, it opens up a few fun directions. In the FOAF scene a few years ago, we had fun using RDF to tag image region parts with information about the things they depicted. But we didn’t really get into questions of surface-syntax, ie. how to maker rich claims about the image area directly within the HTML markup. These days, some combination of RDFa or microformats would probably be the thing to use (or perhaps GRDDL). I’ve sent mail to the RDFa group looking for help with this (see that message for various further related-work links too).

Specifically, I’d love to have some clean HTML markup that said, not just “this area of the photo is associated with the URI http://danbri.org/”, but “this area is the Person whose openid is danbri.org, … and this area depicts the thing that is the primary topic of http://en.wikipedia.org/wiki/Eiffel_Tower”. If we had this, I think we’d have some nice tools for finding images, for explaining images to people who can’t see them, and for connecting people and social networks through codepiction.

Codepiction

Querying Facebook in SPARQL

A fair few people have been asking about FOAF exporters from Facebook. I’m not entirely sure what else is out there, but Matthew Rowe has just announced a Facebook FOAF generator. It doesn’t dump all 35 million records into your Web browser, thankfully. But it will export a minimal description of you and your Facebook associates. At the moment, you get name, a photo URL, and (in this revision of the tool) a Facebook account name using FOAF’s OnlineAccount construct.

As an aside, this part of the FOAF design provides a way for identifiers from arbitrary services to be described in FOAF without special-purpose support. Some services have shortcut property names, eg. msnChatID and we may add more, but it is also important to allow this kind of freeform, decentralised identification. People shouldn’t have to petition the FOAF spec editors before any given Social Network site’s IDs can be supported; they can always use their own vocabulary alongside FOAF, or use the OnlineAccount construct as shown here.

I’ve saved my Facebook export on my Web site, working on the assumption that Facebook IDs are not private data. If people think otherwise, let me know and I’ll change the setup. We might also discuss whether even sharing the names and connectivity graph will upset people’s privacy expectations, but that’s for another day. Let me know if you’re annoyed!

Here is a quick SPARQL query, which simply asks for details of each person mentioned in the file who has an account on Facebook.


PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?name, ?pic, ?id
WHERE {
[ a :Person;
:name ?name;
:depiction ?pic;
:holdsAccount [ :accountServiceHomepage <http://www.facebook.com/> ; :accountName ?id ]
]
}
ORDER BY ?name

I tested this online using Dave Beckett’s Rasqal-based Web service. It should return a big list of the first 200 people matched by the query, ordered alphabetically by name.

For “Web 2.0″ fans, SPARQL‘s result sets are essentially tabular (just like SQL), and have encodings in both simple XML and JSON. So whatever you might have heard about RDF’s syntactic complexity, you can forget it when dealing with a SPARQL engine.

Here’s a fragment of the JSON results from the above query:


{
"name" : { "type": "literal", "value": "Dan Brickley" },
"pic" : { "type": "uri", "value": "http://danbri.org/yasns/facebook/danbri-fb.rdf" },
"id" : { "type": "literal", "value": "624168" }
},
{
"name" : { "type": "literal", "value": "Dan Brickley" },
"pic" : { "type": "uri", "value": "http://profile.ak.facebook.com/profile5/575/66/s501730978_7421.jpg" },
"id" : { "type": "literal", "value": "501730978" }
}, ...

What’s going on here? (a) Why are there two of me? (b) And why does it think that one of us has my Facebook FOAF file’s URL as a mugshot picture?

There’s no big mystery here. Firstly, there’s another guy who has the cheek to be called Dan Brickley. We’re friends on Facebook, even though we should probably be mortal enemies or something. Secondly, why does it give him the wrong URL for his photo? This is also straightforward, if a little technical. Basically, it’s an easily-fixed bug in this version of the FOAF exporter I used. When an image URL is not available, the convertor is still generating markup like “<foaf:depiction rdf:resource=””/>”. This empty URL is treated in RDF as the extreme case of a relative link, ie. the same kind of thing as writing “../../images/me.jpg” in a normal Web page. And since RDF is all about de-contextualising information, your RDF parser will try to resolve the relative link before passing the data on to storage or query systems (fiddly details are available to those that care). If the foaf:depiction property were simply ommitted when no photo was present, this problem wouldn’t arise. We’d then have to make the query a little more flexible, so that it still matched people even if there was no depiction, but that’s easy. I’ll show it next time.

I mentioned a couple of days ago that SPARQL is a query language with built-in support for asking questions about data provenance, ie. we can mix in “according to Facebook”, “according to Jabber” right into the WHERE clause of queries such as the one I show here. I’m not going to get into that today, but I will close with a visual observation about why that is important.

yasn map, borrowed from data junk, valleywag blog
To state the obvious, there’ll always be multiple Web sites where people hang out and socialise. A friend sent me this link the other day; a world map of social networks (thumbnail version copied here). I can’t vouch for the science behind it, but it makes the point that we risk fragmenting Web communities on geographic boundaries if we don’t bridge the various IM and YASN networks. There are lots of ways this can be done, each with different implications for user experience, business model, cost and practicality. But it has to happen. And when it does, we’ll be wanting ways of asking questions against aggregations from across these sites…