YouAndYouAndYouTube: Viacom, Privacy and the Social Graph API

From Wired via Thomas Roessler:

Google will have to turn over every record of every video watched by YouTube users, including users’ names and IP addresses, to Viacom, which is suing Google for allowing clips of its copyright videos to appear on YouTube, a judge ruled Wednesday.

I hope nobody thought their behaviour on youtube.com was a private matter between them and Google.

The Judge’s ruling (pdf) is interesting to read (ok, to skim). As the Wired article says,

The judge also turned Google’s own defense of its data retention policies — that IP addresses of computers aren’t personally revealing in and of themselves, against it to justify the log dump.

Here’s an excerpt. Note that there is also a claim that youtube account IDs aren’t personally identifying.

Defendants argue that the data should not be disclosed because of the users’ privacy concerns, saying that “Plaintiffs would likely be able to determine the viewing and video uploading habits of YouTube’s users based on the user’s login ID and the user’s IP address” .

But defendants cite no authority barring them from disclosing such information in civil discovery proceedings, and their privacy concerns are speculative.  Defendants do not refute that the “login ID is an anonymous pseudonym that users create for themselves when they sign up with YouTube” which without more “cannot identify specific individuals”, and Google has elsewhere stated:

“We . . . are strong supporters of the idea that data protection laws should apply to any data  that could identify you.  The reality is though that in most cases, an IP address without additional information cannot.” — Google Software Engineer Alma Whitten, Are IP addresses personal?, GOOGLE PUBLIC POLICY BLOG (Feb. 22, 2008)

So forget the IP address part for now.

Since early this year, Google have been operating an experimental service called the Social Graph API. From their own introduction to the technology:

With so many websites to join, users must decide where to invest significant time in adding their same connections over and over. For developers, this means it is difficult to build successful web applications that hinge upon a critical mass of users for content and interaction. With the Social Graph API, developers can now utilize public connections their users have already created in other web services. It makes information about public connections between people easily available and useful.

Only public data. The API returns web addresses of public pages and publicly declared connections between them. The API cannot access non-public information, such as private profile pages or websites accessible to a limited group of friends.

Google’s Social Graph API makes easier something that was already possible: using XFN and FOAF markup from the public Web to associate more personal information with YouTube accounts. This makes information that was already public increasingly accessible to automated processing. If I choose to link to my YouTube profile with the XFN markup rel=’me’ from another of my profiles,  those 8 characters are sufficient to bridge my allegedly anonymous YouTube ID with arbitrary other personal information. This is done in a machine-readable manner, one that Google has already demonstrated a planet-wide index for.

Here is the data returned by Google’s Social Graph API when asking for everything about my YouTube URL:

{
 "canonical_mapping": {
  "http://youtube.com/user/modanbri": "http://youtube.com/user/modanbri"
 },
 "nodes": {
  "http://youtube.com/user/modanbri": {
   "attributes": {
    "url": "http://youtube.com/user/modanbri",
    "profile": "http://youtube.com/user/modanbri",
    "rss": "http://youtube.com/rss/user/modanbri/videos.rss"
   },
   "claimed_nodes": [
   ],
   "unverified_claiming_nodes": [
    "http://friendfeed.com/danbri",
    "http://www.mybloglog.com/buzz/members/danbri"
   ],
   "nodes_referenced": {
   },
   "nodes_referenced_by": {
    "http://friendfeed.com/danbri": {
     "types": [
      "me"
     ]
    },
    "http://guttertec.swurl.com/friends": {
     "types": [
      "friend"
     ]
    },
    "http://www.mybloglog.com/buzz/members/danbri": {
     "types": [
      "me"
     ]
    }
   }
  }
 }
}

You can see here that the SGAPI, built on top of Google’s Web crawl of public pages, has picked out the connection to my FriendFeed (see FOAF file) and MyBlogLog (see FOAF file) accounts, both of whom export XFN and FOAF descriptions of my relationship to this YouTube account, linking it up with various other sites and profiles I’m publicly associated with.

YouTube users who have linked their YouTube account URLs from other social Web sites (something sites like FriendFeed and MyBlogLog actively encourage), are no longer anonymous on YouTube. This is their choice. It can give them a mechanism for sharing ‘favourited’ videos with a wide circle of friends, without those friends needing logins on YouTube or other Google services. This clearly has business value for YouTube and similar ‘social video’ services, as well as for users and Social Web aggregators.

Given such a trend towards increased cross-site profile linkage, it is unfortunate to read that YouTube identifiers are being presented as essentially anonymous IDs: this is clearly not the case. If you know my YouTube ID ‘modanbri’ you can quite easily find out a lot more about me, and certainly enough to find out with strong probability my real world identity. As I say, this is my conscious choice as a YouTube user; had I wanted to be (more) anonymous, I would have behaved differently. To understand YouTube IDs as being anonymous accounts is to radically misunderstand the nature of the modern Web.

Although it wouldn’t protect against all analysis, I hope the user IDs are at least scrambled before being handed over to Viacom. This would make it harder for them to be used to look up other data via (amongst other things) Google’s own YouTube and Social Graph APIs.

Update: I should note also that the bridging of YouTube IDs with other profiles is one that is not solely under the control of the YouTube user. Friends, contacts, followers and fans on other sites can link to YouTube profiles freely; this can be enough to compromise an otherwise anonymous account. Increasingly, these links are machine-processable; a trend I’ve previously argued is (for better or worse) inevitable.

Furthermore, the hypertext and data environment around YouTube and the Social Web is rapidly evolving; the lookups and associations we’ll be able to make in 1-2 years will outstrip what is possible today. It only takes a single hyperlink to reveal the owner of a YouTube account name; many such links will be created in the months to come.

Inevitable Nipple Analogy

A genetic theory of homosexuality.

The article reports on recent work (pdf) addressing the ‘if homosexuality is genetic, why hasn’t it died out?’ debate, which suggests that the ‘gene for male homosexuality persists because it promotes—and is passed down through—high rates of procreation among gay men’s mothers, sisters, and aunts‘.

In other words, gayness in men can be as natural and as the male nipple, even if both are initially puzzling when thought of in evolutionary terms. OK I’m stretching things slightly, but I can’t help but wonder whether the nipple analogy might be a good basis for informal arguments for a bit more tolerance:

Let He Who is Without Nipples Cast the First Stone?

Or maybe not.

More on the Great Nipple Question from straightdope.com; a moment of science; and the evolution-101 blog.

Beautiful plumage: Topic Maps Not Dead Yet

Echoing recent discussion of Semantic Web “Killer Apps”, an “are Topic Maps dead?” thread on the topicmaps mailing list. Signs of life offered include www.fuzzzy.com (‘Collaborative, semantic and democratic social bookmarking’, Topic Maps meet social networking; featured tag: ‘topic maps‘) and a longer-list from Are Gulbrandsen who suggests a predictable hype-cycle dropoff is occuring, as well as a migration of discussions from email into the blog world. For which, see the topicmaps planet aggregator, and through which I indirectly find Steve Pepper’s blog and an interesting post on how TMs relate to RDF, OWL and the Semantic Web (though I’d have hoped for some mention of SKOS too).

Are Gulbrandsen also cites NZETC (the New Zealand Electronic Tech Centre), winner of The Topic Maps Application of the year award at the Topic Maps 2008 conference; see Conal Tuohy’s presentation on Topic Maps for Cultural Heritage Collections (slides in PDF). On NZETC’s work: “It may not look that interesting to many people used to flashy web 2.0 sites, but to anybody who have been looking at library systems it’s a paradigm shift“.

Other Topic Map work highlighted: RAMline (Royal Academy of Music rewriting musical history). “A long-term research project into the mapping of three axes of musical time: the historical, the functional, and musical time itself.”; David Weinberger blogged about this work recently. Also MIPS / Institute for Bioinformatics and Systems Biology who “attempt to explain the complexity of life with Topic Maps” (see presentation from Volker Stümpflen (PDF); also a TMRA’07 talk).

Finally, pointers to opensource developer tools: Ruby Topic Maps and Wandora (Java/GPL), an extraction/mapping and publishing system which amongst other things can import RDF.

Topic Maps are clearly not dead, and the Web’s a richer environment because of this. They may not have set the world on fire but people are finding value in the specs and tools, while also exploring interop with RDF and other related technologies. My hunch is that we’ll continue to see a slow drift towards the use of RDF/OWL plus SKOS for apps that might otherwise have been addressed using TopicMaps, and a continued pragmatism from tool and app developers who see all these things as ways to solve problems, rather than as ends in themselves.

Just as with RDFa, GRDDL and Microformats, it is good and healthy for the Web community to be exploring multiple similar strands of activity. We’re smart enough to be able to flow data across these divides when needed, and having only a single technology stack is I think both intellectually limiting, socially impractical, and technologically short-sighted.

Map-reduce-merge and Hadoop/Hbase RDF

 Just found this interesting presentation,

Map-Reduce-Merge:  Simpli?ed Relational  Data Processing on  Large Clusters
by Hung-chih Yang, Ali Dasdan Ruey-Lung Hsiao, D. Stott Parker; as presented by Nate Rober  (PDF)

Excerpts:

Extending MapReduce
1. Change to reduce phase
2. Merge phase
3. Additional user-de?nable operations
a. partition selector
b. processor
c. merger
d. con?gurable iterators

Implementing Relational Algebra Operations
1. Projection
2. Aggregation
3. Selection
4. Set Operations: Union, Intersection, Difference
5. Cartesian Product
6. Rename
7. Join

[for more detail see full slides]

Conclusion
MapReduce & GFS represent a paradigm shift in data processing: use a simpli?ed interface instead of overly general DBMS.
Map-Reduce-Merge adds the ability to execute arbitrary relational algebra queries.
Next steps: develop SQL-like interface and  a query optimizer.

Research paper: Map-reduce-merge: simplified relational data processing on large clusters (PDF for ACM people)

Linked from HRDF page in the Hadoop wiki, where there appears to be a proposal brewing to build an RDF store on top of the Hadoop/Hbase infrastructure.

Nearby: LargeTripleStores in ESW wiki

Not entirely unrelated: Google Social Graph API  (which parsers FOAF/RDF from ‘The Web’ but discards all but the social graph parts currently)

“Stuff I’ve been thinking about” (SocialNetworkPortability WebCamp) – my slides

I’m in Cork, mainly for the excellent Social Network Portability event on Sunday, but am also staying through Blogtalk’08 which has been great. I’ve uploaded my slides from my talk (slideshare in Flash, included inline here, or a pdf). I have some rough speaking notes too,  maybe I’ll get those online. I have no idea how they relate to whatever actually came out of my mouth during the talk :) Apologies to those without PDF or Flash. I haven’t tried Keynote’s HTML output yet.

Basically much of what I was getting at in the talk, and my thoughts are only just congealing on this … is that the idea of a ‘claim’ is a useful bridge between Semantic Web and Social Networking concerns. Also that it helps us understand how technologies fit together. FOAF defines a dictionary of terms for making claims, as does xfn, hCard. RDF/XML, Microformats, RDFa, GRDDL define textual notations for publishing documents that encode claims, and SPARQL gives us a way of asking questions about the claims made in different documents.

OpenID plugin for WordPress

I’ve just installed Alan J Castonguay’s WordPress OpenID plugin on my blog, part of a cleanup that included nuking 11000+ comments in the moderation queue using the Spam Karma 2 plugin. Apologies if I zapped any real comments too. There are a few left, at least!

The OpenID thing appears to “just work”. By which I mean, I could log in via it and leave a comment. I’d be super-grateful if those of you with OpenIDs could take a minute to leave a comment on this post, to see if it works as well as it seems to. If it doesn’t, a bug report (to danbrickley@gmail.com) would be much appreciated. Those of you with LiveJournals or AOL/AIM accounts already have OpenID, even if you didn’t notice. See the HTML source for my homepage to see how I use “danbri.org” as an OpenID while delegating the hard work to LiveJournal. For more on OpenID, check out these tutorial slides (flash/pdf) from Simon Willison and David Recordon.

Thinking about OpenID-mediated blog comments, the tempting thing then would be to do something with the accumulated URIs. The plugin keeps its data in nice SQL tables and presumably accessible by other WordPress plugins. It’s been a while since I made a WordPress plugin, but they seem to have a pretty good framework accessible to them now.

mysql> select user_id, url from wp_openid_identities;
+---------+--------------------+
| user_id | url                |
+---------+--------------------+
|      46 | http://danbri.org/ |
+---------+--------------------+
1 row in set (0.28 sec)

At the moment, it’s just me. It’d be fun to try scooping up RDF (FOAF, SKOS, SIOC, feeds…) from any OpenID URIs that accumulate there. Hmm I even wrote up that project idea a while back – SparqlPress. At the time I tried prototyping it in Redland + PHP, but nowadays I’d probably use Benjamin Nowack’s ARC library, which provides SPARQL query of a MySQL-backed RDF store, and is written in PHP. This gives it the same dependencies as WordPress, making it ideal for pluginization. If anyone’s looking for a modest-sized practical SemWeb project to hack on, that one could be a lot of fun.

There’s a lot of interesting and creative fuss about “social networking” site interop around lately, largely thanks to the social graph paper from Brad Fitzpatrick and David Recordon. I lean towards the “show me, don’t tell me” approach regarding buddylists and suchlike (as does Julian Bond with Ecademy), which is why FOAF has only ever had the mild-mannered “knows” relationship in the core vocabulary, rather than trying to over-formalise “bestest friend EVER” and other teenisms. So what I like about this WordPress plugin is that it gives some evidence-based raw material for decentralised social networking apps. Blog comments don’t tell the whole story; nothing tells the whole story. But rather than maintain a FOAF “knows” list (or blogroll, or blog-reader config) by hand, I’d prefer to be able to partially automate it by querying information about whose blogs I’ve commented on, and vice-versa. There’s a lot that could be built, intimidatingly much, that it’s hard to know where to start. I suggest that everyone in the SemWeb scene having an OpenID with a FOAF file linked from it would be an interesting platform from which to start exploring…

Meanwhile, I’ll try generating an RDF blogroll from any URIs that show up in my OpenID WordPress table, so I can generate a planetplanet or chumpologica configuration automatically…