Project ideas


OpenSocial’s API reference describes a number of classes (’Person’, ‘Name’, ‘Email’, ‘Phone’, ‘Url’, ‘Organization’, ‘Address’, ‘Message’, ‘Activity’, ‘MediaItem’, ‘Activity’, …), each of which has various properties whose values are either strings, references to instances of other classes, or enumerations. I’d like to make them usable beyond the confines of OpenSocial, so I’m making an RDF/OWL version. OpenSocial’s schema is an attempt to provide an overarching model for much of present-day mainstream ’social networking’ functionality, including dating, jobs etc. Such a broad effort is inevitably somewhat open-ended, and so may benefit from being linked to data from other complementary sources.

With a bit of help from the shindig-dev list, #opensocial IRC, and Kevin Brown and Kevin Marks, I’ve tracked down the source files used to represent OpenSocial’s data schemas: they’re in the opensocial-resources SVN repository on code.google.com. There is also a downstream copy in the Apache Shindig SVN repo (I’m not very clear on how versioning and evolution is managed between the two). They’re Javascript files, structured so that documentation can be generated via javadoc. The Shindig-PHP schema diagram I posted recently is a representation of this schema.

So - my RDF version. At the moment it is merely a list of classes and their properties (expressed using via rdfs:domain), written using RDFa/HTML. I don’t yet define rdfs:range for any of these, nor handle the enumerated values (opensocial.Enum.Smoker, opensocial.Enum.Drinker, opensocial.Enum.Gender, opensocial.Enum.LookingFor, opensocial.Enum.Presence) that are defined in enum.js.

The code is all in the FOAF SVN, and accessible via “svn co http://svn.foaf-project.org/foaftown/opensocial/vocab/”. I’ve also taken the liberty of including a copy of the OpenSocial *.js files, and Mozilla’s Rhino Javascript interpreter js.jar in there too, for self-containedness.

The code in schemarama.js will simply generate an RDFA/XHTML page describing the schema. This can be checked using the W3C validator, or converted to RDF/XML with the pyRDFa service at W3C.

I’ve tested the output using the OwlSight/pellet service from Clark & Parsia, and with Protege 4. It’s basic but seems OK and a foundation to build from. Here’s a screenshot of the output loaded into Protege (which btw finds 10 classes and 99 properties).

An example view from protege, showing the class browser in one panel, and a few properties of Person in another.

OK so why might this be interesting?

  • Using OpenSocial-derrived vocabulary, OpenSocial-exported data in other contexts
    • databases (queryable via SPARQL)
    • mixed with FOAF
    • mixed with Microformats
    • published directly in RDFa/HTML
  • Mapping OpenSocial terms with other contact and social network schemas

This suggests some goals for continued exploration:

It should be possible to use “OpenSocial markup” in an ordinary homepage or blog (HTML or XHTML), drawing on any of the descriptive concepts they define, through using RDFa’s markup notation. As Mark Birbeck pointed out recently, RDFa is an empty vessel - it does not define any descriptive vocabulary. Instead, the RDF toolset offers an environment in which vocabulary from multiple independent sources can be mixed and merged quite freely. The hard work of the OpenSocial team in analysing social network schemas and finding commonalities, or of the Microformats scene in defining simple building-block vocabularies … these can hopefully be combined within a single environment.

I’m not at the BBC’s 2008 hackday-like-event, Mashed. But here’s a quick hack based on the data the BBC audio and music team have made available. The data that caught my eye was “Genres for set of MusicBrainz Artists” based on editorial data entered for bbc.co.uk/music. This is a simple file:

0039c7ae-e1a7-4a7d-9b49-0cbc716821a6    Rock and Indie
003abc43-e2bb-40e5-a080-3c4b9e56ea63    Classical
0053dbd9-bfbc-4e38-9f08-66a27d914c38    Classic Pop and Rock

It maps a MusicBrainz artist ID (increasingly the defacto open standard for identifying artists, at least in popular western music) to a simple genre label.

I haven’t yet found corresponding pages on the BBC music site for each of these genres.

Since last.fm expose my last 12 month’s most commonly played artists for all to mock, it is quite easy to cross-reference these sources to get a summary of my alleged musical interests.

A commandline ruby script online for now:

Airbag:mashed danbri$ ruby lastfm-genres.rb
Classic Pop and Rock: 13
Rock and Indie: 17
Hip Hop; RnB and Dance Hall: 1
World: 1
Dance and Electronica: 12

It’s a while since I wrote any code, clearly: this should at least be sorted and trimmed to the top 3 or so. We’d need to look at a few people’s profiles to figure out the best approach to summarising someone’s interests, and a little thought is needed for representing this in RDF/FOAF.

Now where I see OAuth fitting into this picture is the “what do we do next” step. OAuth potentially addresses a problem we’ve had in the FOAF scene, whereby FOAF generators and adaptors produce a chunk of markup, but there’s no easy/natural way to post this back into the Web. I’m hoping that blogs and hosting sites will allow external FOAF sources (like this script) to update/augment the FOAF descriptions we host in our existing Web sites and profiles. I sent some notes on this to the OAuth list (albeit to a deafening silence).

See also:  mashed last.fm / bbc genres ruby script

 Just found this interesting presentation,

Map-Reduce-Merge:  Simpli?ed Relational  Data Processing on  Large Clusters
by Hung-chih Yang, Ali Dasdan Ruey-Lung Hsiao, D. Stott Parker; as presented by Nate Rober  (PDF)

Excerpts:

Extending MapReduce
1. Change to reduce phase
2. Merge phase
3. Additional user-de?nable operations
a. partition selector
b. processor
c. merger
d. con?gurable iterators

Implementing Relational Algebra Operations
1. Projection
2. Aggregation
3. Selection
4. Set Operations: Union, Intersection, Difference
5. Cartesian Product
6. Rename
7. Join

[for more detail see full slides]

Conclusion
MapReduce & GFS represent a paradigm shift in data processing: use a simpli?ed interface instead of overly general DBMS.
Map-Reduce-Merge adds the ability to execute arbitrary relational algebra queries.
Next steps: develop SQL-like interface and  a query optimizer.

Research paper: Map-reduce-merge: simplified relational data processing on large clusters (PDF for ACM people)

Linked from HRDF page in the Hadoop wiki, where there appears to be a proposal brewing to build an RDF store on top of the Hadoop/Hbase infrastructure.

Nearby: LargeTripleStores in ESW wiki

Not entirely unrelated: Google Social Graph API  (which parsers FOAF/RDF from ‘The Web’ but discards all but the social graph parts currently)

Via Libby; Bruce Schneier on data:

In the information age, we all have a data shadow.

We leave data everywhere we go. It’s not just our bank accounts and stock portfolios, or our itemized bills, listing every credit card purchase and telephone call we make. It’s automatic road-toll collection systems, supermarket affinity cards, ATMs and so on.

It’s also our lives. Our love letters and friendly chat. Our personal e-mails and SMS messages. Our business plans, strategies and offhand conversations. Our political leanings and positions. And this is just the data we interact with. We all have shadow selves living in the data banks of hundreds of corporations’ information brokers — information about us that is both surprisingly personal and uncannily complete — except for the errors that you can neither see nor correct.

What happens to our data happens to ourselves.

This shadow self doesn’t just sit there: It’s constantly touched. It’s examined and judged. When we apply for a bank loan, it’s our data that determines whether or not we get it. When we try to board an airplane, it’s our data that determines how thoroughly we get searched — or whether we get to board at all. If the government wants to investigate us, they’re more likely to go through our data than they are to search our homes; for a lot of that data, they don’t even need a warrant.

Who controls our data controls our lives. [...]

Increasingly, we’re going to be seeing this data flow through protocols like OAuth. SemWeb people should get their heads around how this is likely to work. It’s rather likely we’ll see SPARQL data stores with non-public personal data flowing through them; what worries me is that there’s not yet any data management discipline on top of this that’ll help us keep track of who is allowed to see what, and which graphs should be deleted or refreshed at which times.

I recently transcribed some notes from a Robert Scoble post about Facebook and data portability into the FOAF wiki. In it, Scoble reported some comments from Dave Morin of Facebook, regardling data flow. Excerpts:

For instance, what if a user wants to delete his or her info off of Facebook. Today that’s possible. But what about in a really data portable world? After all, in such a world Facebook might have sprayed your email and other data to other social networks. What if those other social networks don’t want to delete your data after you asked Facebook to?

Another case: you want your closest Facebook friends to know your birthday, but not everyone else. How do you make your social network data portable, but make sure that your privacy is secured?

Another case? Which of your data is yours? Which belongs to your friends? And, which belongs to the social network itself? For instance, we can say that my photos that I put on Facebook are mine and that they should also be shared with, say, Flickr or SmugMug, right? How about the comments under those photos? The tags? The privacy data that was entered about them? The voting data? And other stuff that other users might have put onto those photos? Is all of that stuff supposed to be portable? (I’d argue no, cause how would a comment left by a Facebook user on Facebook be good on Flickr?) So, if you argue no, where is the line? And, even if we can all agree on where the line is, how do we get both Facebook and Flickr to build the APIs needed to make that happen?

I’d like to see SPARQL stores that can police their data access behaviour, with clarity for each data graph in the store about the contexts in which that data can be re-exposed, and the schedule by which the data should be refreshed or purged. Making it easy for data to flow is only half the problem…


How it works: The Web
Originally uploaded by danbri

Or, “but what do all those links mean?”

Based on the 1994 slides by TimBL which inspired the SWAD-Europe graphics and shirt.

The twist here is just an emphasis that the giant global graph is a graph of idiosyncratic claims, and only sometimes do we all see the world the same way.

ordinary life is pretty complex stuff“ — Harvey Pekar

I just signed up to give a talk at the Microformats vEvent in London, May 27th; thanks to the organizers (Frances Berriman and Drew McLellen of microformats.org) for inviting me :)

I’ve called it “One Big Happy Family: Practical Collaboration on Meaningful Markup” and my goal really is to help make it easier for enthusiasts for both RDF and Microformats to say ‘we‘ rather than ‘they‘ a bit more often when discussing complementary efforts from this community. As I said on the foaf-dev list yesterday, “anything good for Microformats is good for FOAF”; vice-versa too, I hope. There’s only one Web and we’re all doing our bit, with the tools and techniques we know best.

Here’s the abstract:

This talk explores some ways in which the Microformat and RDF approaches can complement each other, and some ways in which we can share data, tools and experiences between these two technologies. It will outline the often-unarticulated history of the RDF design, the techniques used for parsing and querying RDF data, and the things made easy and hard through this approach. RDF techniques can be contrasted with the different choices made for Microformats. However these differences obscure an underlying similarity that comes from shared ‘Webby’ values.

Edit: it seems I’m incapable of spelling “compl[ie]mentary”. Freudian slip? :)

BTW the London Web Week site has just gone live; check it out…

Speaks, reads, writes
Stephanie Booth asks:

 I vaguely remember somebody telling me about some emerging “standard” (too big a word) for encoding language skills. Or was it a dream?

That would’ve been me, showing markup from the FOAFX beta from Paola Di Maio and friends, which explores the extension of FOAF with expertise information. This is part of the ExpertFinder discussions alongside the FOAF project (see also wiki, mailing list). FOAFX and the ExpertFinder community are looking at ways of extending FOAF to better describe people’s expertise; both self-described and externally accredited. This is at once a fascinating, important and terrifyingly hard to scope problem area. It touches on longstanding “Web of trust” themes, on educational metadata standards, and on the various ways of characterising topics or domains of expertise. In other words, in any such problem space, there will always be multiple ways of “doing it”. For example, here is how the Advogato community site characterises my expertise regarding opensource software: foaf.rdf (I’m in the Journeyer group, apparently; some weighted average of people’s judgements about me).

One thing FOAFX attempts is to describe language skills. For this, they extend the idiom proposed by Inkel some years ago in his “Speaks, Reads, Writesschema. In the original (which is Spanish, but see also English version), the classification was effectively binary: one could either speak, read, or write a language; or one couldn’t. You could also say you ‘mastered’ it, meaning that you could speak, read and write it. In FOAFX, this is handled differently: we get a 1-5 score. I like this direction, as it allows me to express that I have some basic capability in Spanish, without appearing to boast that I’m anything like “fluent”. But … am I a “1″ or a “2″? Should I poll my long-suffering Spanish-speaking friends? Take an online quiz? Introducing numbers gives the impression of mathematical precision, but in skill characterisation this is notoriously hard (and not without controversy).

My take here is that there’s no right thing to do. So progress and experimentation are to be celebrated, even if the solution isn’t perfect. On language skills, I’d love some way also to allow people to say “I’m learning language X”, or “I’m happy to help you practice your English/Spanish/Japanese/etc.”. Who knows, with more such information available, online Social Network sites could even prove useful…

Here btw is the current RDF markup generated by FOAFX:

<foaf:Person rdf:ID="me">
<foaf:mbox_sha1>6e80d02de4cb3376605a34976e31188bb16180d0</foaf:mbox_sha1>
<foaf:givenname>Dan</foaf:givenname>
<foaf:family_name>Brickley</foaf:family_name>
<foaf:homepage rdf:resource="http://danbri.org/" />
<foaf:weblog rdf:resource="http://danbri.org/words/" />
<foaf:depiction rdf:resource="http://danbri.org/images/me.jpg" />
<foaf:jabberID>danbrickley@gmail.com</foaf:jabberID>
<foafx:language>
<foafx:Language>
<foafx:name>English</foafx:name>
<foafx:speaking>5</foafx:speaking>
<foafx:reading>5</foafx:reading>
<foafx:writing>5</foafx:writing>
</foafx:Language>
</foafx:language>
<foafx:language>
<foafx:Language>
<foafx:name>Spanish</foafx:name>
<foafx:speaking>1</foafx:speaking>
<foafx:reading>1</foafx:reading>
<foafx:writing>1</foafx:writing>
</foafx:Language>
</foafx:language>
<foafx:expertise>
<foafx:Expertise>
<foafx:field>::</foafx:field>
<foafx:fluency>
<foafx:Language>
<foafx:name>English</foafx:name>
</foafx:Language>
</foafx:fluency>
</foafx:Expertise>
</foafx:expertise>
</foaf:Person>

The apparent redundancy in the markup (expertise, Expertise) is due to RDF’s so-called “striped” syntax. I have an old introduction to this idea; in short, RDF lets you define properties of things, and categories of thing. The FOAFX design effectively says, “there is a property of a person called “expertise” which relates that person to another thing, an “Expertise”, which itself has properties like “fluency”.

The FOAFX design tries to navigate between generic and specific, by including language-oriented markup as well as more generic skill descriptions. I think this is probably the right way to go. There are many things that we can say about human languages that don’t apply to other areas of expertise (eg. opensource software development). And there many things we can say about expertise in general (like expressions of willingness to learn, to teach, … indications of formal qualification) which are cross domain. Similarly, there are many things we might say in markup about opensource projects (picking up on my Advogato mention earlier) which have nothing to do with human languages. Yet both human language expertise and opensource skills are things we might want to express via FOAF extensions. For example, the DOAP project already lets us describe opensource projects and our roles in them.

The Semantic Web design challenge here is to provide a melting pot for all these different kinds of data, one that allows each specific problem to be solved adequately in a reasonable time-frame, without precluding the possibility for richer integration at a later date. I have a hunch that the Advogato design, which expresses skills in terms of group membership, could be a way to go here.

This is related to the idea of expressing group-membership criteria through writing SPARQL queries. For example, we can talk about the Group of people who work for W3C. Or we can talk about the Group of people who work for W3C as listed authoritatively on the W3C site. Both rules are expressible as queries; the latter a query that says things about the source of claims, as well as about what those claims assert. This notion of a group defined by a query allows for both flavours; the definition could include criteria relating to the provenance (ie. source) of the claims, but it needn’t. So we could express the idea of people who speak Spanish, or the idea of people who speak french according to having passed some particular test, or being certified by some agency. In either case, the unifying notion is “person X is in group Y”, where Y is a group identified by some URL. What I like about this model, is it allows for a very loose division of labour: skill-related markup is necessarily going to be widely varied. Yet the idea that such scattered evidence boils down to people falling into definable groups, gives some overall cohesion to this diversity. I could for example run a query asking for people with (foafx idiom) “Spanish skills of 2 or more”. I could add a constraint that the person be at least a “Journeyer” regarding their opensource skills, according to Advogato, or perhaps mix in data expressed in DOAP terms regarding their roles in opensource project work. These skills effectively define groups (loosly, sets) of people, and skill search can be pictured in venn diagram terms. Of course all this depends on getting enough data out there for any such queries to be worthwhile. Maybe a Facebook app that re-published data outside of Hotel Facebook would be a way of bootstrapping things here?

Venn groups diagram

This is the result of feeding a relatively small list of groups and their members to the VennMaster java tool. I’ve been looking for swooshy automatic layout tools that might help with interactive visualisation of ’social graph’ data where people can be clustered by their membership of various groups. I also wanted to explore the possibility of using such a tool as a way of authoring filters from raw evidence, using groups such as “have sent mail to”, “have accepted blog/wiki comment from”.

My gut reaction from this quick experiment is that the UI space is very easily overwhelmed. I used here just a quick hand-coded list of people, in fairly ad-hoc groups (cities, current-and-former workplaces etc.). Real data has more people and more groups. Still, I think there may be something worth investigating here. The venn tool I found was really designed for lifesci data, not people. If anyone knows of other possible software to try here, do let me know. To try this tool, simply run the Java app from the commandline, and use “File >> OpenList” on a file such as people.list.

One other thing I noticed in creating these ad-hoc groups (more or less ‘people tags’), is that representing what people have done felt intuitively as important as what they’re doing right now. For example, places people once lived or worked. This gives another axis of complexity that might need visualising. I’d like the underlying data to know who currently works/lives somewhere, versus “used to”, but in some views the two might appropriately be folded together. Tricky.