Video Linking: Archives and Encyclopedias

This is a quick visual teaser for some archive.org-related work I’m doing with NoTube colleagues, and a collaboration with Kingsley Idehen on navigating it.

In NoTube we are trying to match people and TV content by using rich linked data representations of both. I love Archive.org and with their help have crawled an experimental subset of the video-related metadata for the Archive. I’ve also used a couple of other sources; Sean P. Aune’s list of 40 great movies, and the Wikipedia page listing US public domain films. I fixed, merged and scraped until I had a reasonable sample dataset for testing.

I wanted to test the Microsoft Pivot Viewer (a Silverlight control), and since OpenLink’s Virtuoso package now has built-in support, I got talking with Kingsley and we ended up with the following demo. Since not everyone has Silverlight, and this is just a rough prototype that may be offline, I’ve made a few screenshots. The real thing is very visual, with animated zooms and transitions, but screenshots give the basic idea.

Notes: the core dataset for now is just links between archive.org entries and Wikipedia/dbpedia pages. In NoTube we’ll also try Lupedia, Zemanta, Reuter’s OpenCalais services on the Archive.org descriptions to see if they suggest other useful links and categories, as well as any other enrichment sources (delicious tags, machine learning) we can find. There is also more metadata from the Archive that we should also be using.

This simple preview simply shows how one extra fact per Archived item creates new opportunities for navigation, discovery and understanding. Note that the UI is in no way tuned to be TV, video or archive specific; rather it just lets you explore a group of items by their ‘facets’ or common properties. It also reveals that wiki data is rather chaotic, however some fields (release date, runtime, director, star etc.) are reliably present. And of course, since the data is from Wikipedia, users can always fix the data.

You often hear Linked Data enthusiasts talk about data “silos”, and the need to interconnect them. All that means here, is that when collections are linked, then improvements to information on one side of the link bring improvements automatically to the other. When a Wikipedia page about a director, actor or movie is improved, it now also improves our means of navigating Archive.org’s wonderful collection. And when someone contributes new video or new HTML5-powered players to the Archive, they’re also enriching the Encyclopedia too.

Archive.org films on a timeline by release date according to Wikipedia.

One thing to mention is that everything here comes from the Wikipedia data that is automatically extracted from by DBpedia, and that currently the extractors are not working perfectly on all films. So it should get better in the future. I also added a lot of the image links myself, semi-automatically. For now, this navigation is much more factually-based than topic; however we do have Wikipedia categories for each film, director, studio etc., and these have been mapped to other category systems (formal and informal), so there’s a lot of other directions to explore.

What else can we do? How about flip the tiled barchart to organize by the film’s distributor, and constrain the ‘release date‘ facet to the 1940s:

That’s nice. But remember that with Linked Data, you’re always dealing with a subset of data. It’s hard to know (and it’s hard for the interface designers to show us) when you have all the relevant data in hand. In this case, we can see what this is telling us about the videos currently available within the demo. But does it tell us anything interesting about all the films in the Archive? All the films in the world? Maybe a little, but interpretation is difficult.

Next: zoom in to a specific item. The legendary Plan 9 from Outer Space (wikipedia / dbpedia).

Note the HTML-based info panel on the right hand side. In this case it’s automatically generated by Virtuoso from properties of the item. A TV-oriented version would be less generic.

Finally, we can explore the collection by constraining the timeline to show us items organized according to release date, for some facet. Here we show it picking out the career of one Edward J. Kay, at least as far as he shows up as composer of items in this collection:

Now turning back to Wikipedia to learn about ‘Edward J. Kay’, I find he has no entry (beyond these passing mentions of his name) in the English Wikipedia, despite his work on The Ape Man, The Fatal Hour, and other films.  While the German Wikipedia does honour him with an entry, I wonder whether this kind of Linked Data navigation will change the dynamics of the ‘deletionism‘ debates at Wikipedia.  Firstly by showing that structured data managed elsewhere can enrich the Wikipedia (and vice-versa), removing some pressure for a single Wiki to cover everything. Secondly it provides a tool to stand further back from the data and view things in a larger context; a context where for example Edward J. Kay’s achievements become clearer.

Much like Freebase Parallax, the Pivot viewer hints at a future in which we explore data by navigating from sets of things to other sets of things - like the set of film’s Edward J. Kay contributed to.  Pivot doesn’t yet cover this, but it does very vividly present the potential for this kind of navigation, showing that navigation of films, TV shows and actors may be richer when it embraces more general mechanisms.

A Penny for your thoughts: New Year wishes from mechanical turkers

I wanted to learn more about Amazon’s Mechanical Turk service (wikipedia), and perhaps also figure out how I feel about it.

Named after a historical faked chess-playing machine, it uses the Web to allow people around the world to work on short low-pay ‘micro-tasks’. It’s a disturbing capitalist fantasy come true, echoing Frederick Taylor’s ‘Scientific Management‘ of the 1880s. Workers can be assigned tasks at the touch of the button (or through software automation); and rewarded or punished at the touch of other buttons.

Mechanical Turk has become popular for outsourcing large scale data cleanup tasks, image annotation, and other topics where human judgement outperforms brainless software. It’s also popular with spammers. For more background see ‘try a week as a turker‘ or this Salon article from 2006. Turk is not alone, other sites either build on it, or offer similar facilities. See for example crowdflower, txteagle, or Panos Ipeirotis’ list of micro-crowdsourcing services.

Crowdflower describe themselves as offering “multiple labor channels…  [using] crowdsourcing to harness a round-the-clock workforce that spans more than 70 countries, multiple languages, and can access up to half-a-million workers to dispatch diverse tasks and provide near-real time answers.”

Txteagle focuses on the explosion of mobile access in the developing world, claiming that “txteagle’s GroundSwell mobile engagement platform provides clients with the ability to communicate and incentivize over 2.1 billion people“.

Something is clearly happening here. As someone who works with data and the Web, it’s hard to ignore the potential. As someone who doesn’t like treating others as interchangeable, replaceable and disposable software components, it’s hard to feel very comfortable. Classic liberal guilt territory. So I made an account, both as a worker and as a ‘requester’ (an awkward term, but it’s clear why ‘employer’ is not being used).

I tried a few tasks. I wrote 25-30 words for a blog on some medieval prophecies. I wrote 50 words as fast as I could on “things I would change in my apartment”. I tagged some images with keywords. I failed to pass a ‘qualification’ test sorting scanned photos into scratched, blurred and OK. I ‘like’d some hopeless Web site on Facebook for 2 cents. In all I made 18 US cents. As a way of passing the time, I can see the appeal. This can compete with daytime TV or Farmville or playing Solitaire or Sudoko. I quite enjoyed the mini creative-writing tasks. As a source of income, it’s quite another story, and the awful word ‘incentivize‘ doesn’t do justice to the human reality.

Then I tried the other role: requester. After a little more liberal-guilt navelgazing (“would it be inappropriate to offer to buy people’s immortal souls? etc.”), I decided to offer a penny (well, 2 cents) for up to 100 people’s new year wish thoughts, or whatever of those they felt like sharing for the price.

I copy the results below, stripped of what little detail (eg. time in seconds taken) each result came with. I don’t present this as any deep insight or sociological analysis or arty meditation. It’s just what 100 people somewhere else in the Web responded with, when asked what they wish for 2011. If you want arty, check out the sheep market. If you want more from ‘turkers’ in their own voice, do visit the ‘Turker Nation’ forum. Also Turkopticon is essential reading,  “watching out for the crowd in crowdsourcing because nobody else seems to be.”

The exact text used was “Make a wish for 2011. Anything you like, just describe it briefly. Answers will be made public.”, and the question was asked with a simple Web form, “Make a wish for 2011, … any thought you care to share”.


Here’s what they said:

When you’re lonely, I wish you Love! When you’re down, I wish you Joy! When you’re troubled, I wish you Peace! When things seem empty, I wish you Hope! Have a Happy New Year!

wish u a happy new year…………

happy new year 2011. may this year bring joy and peace in your life

My wish for 2011 is i want to mary my Girlfriend this year.

I wish I will get pregnant in 2011!

i wish juhi becomes close to me

wish you a wonderful happy new year

wish you happy new year

for new year 2011 I wish Love of God must fill each human heart
Food inflation must be wiped off quickly
corruption must be rooted out smartly
Terrorism must be curtailed quickly
All People must get love, care, clothes, shelter & food
Love of God must fill each human heart…

Happy life.All desires to be fulfilled.

wish to be best entrepreneur of the year 2011

dont work hard if it is possible to do the same smarter way..
Be happy!

New year is the time to unfold new horizons,realise new dreams,rejoice in simple pleasures and gear up for new challenges.wishing a fulfilling 2011.

Remember that the best relationship is one where your love for each other is greater than your need for each other. Happy New Year

To get a newer car, and have less car problems. and have more income

I wish that my son’s health problems will be answered

Be it Success & Prosperity, Be it Fun and Frolic…

A new year is waiting for you. Go and enjoy the New Year on New Thought,”Rebirth of My Life”.

Let us wish for a world as one family, then we can overcome all the problems man made and otherwise.

My wish is to gain/learn more knowledge than in 2010

My new years wish for 2011 is to be happier and healthier.

I wish that I would be cured of heartache.

I am really very happy to wish you all very happy new year…..I wish you all the things to be success in your life and career…….. Just try to quit any bad habit within you. Just forgot all the bad incidents happen within your friends and try to enjoy this new year with pleasant……

Wish you a happy and prosperous new year.

I wish for a job.

I would hope that people will end the wars in the world.

Discontinue smoking and restrict intake of alcohol

I wish that my retail store would get a bigger client base so I can expand.

I Wish a wish for You Dear.Sending you Big bunch of Wishes from the Heart close to where.Wish you a Very Very Happy New Year

I wish for 2011 to be filled with more love and happiness than 2010.

Everything has the solution Even IMPOSSIBLE Makes I aM POSSIBLE. Happy Journey for New Year.

May each day of the coming year be vibrant and new bringing along many reasons for celebrations & rejoices. Happy New year

I have just moved and want to make some great new friends! Would love to meet a special senior (man!!) to share some wonderful times with!!!

My wish is that i wanna to live with my “Pretty girl” forever and also wanna to meet her as well,please god please, finish my this wish, no more aspire from me only once.

that people treat each other more nicely and with greater civility, in both their private and public lives.

that we would get our financial house in order

Year’s end is neither an end nor a beginning but a going on, with all the wisdom that experience can instill in us. Wish u very happy new year and take care

Wish you a very happy And prosperous new year 2011

Tom Cruise
Angelina Jolie
Aishwarya Rai
Arnold
Jennifer Lopez
Amitabh Bachhan
& me..
All the Stars wish u a Very Happy New Year.

Oh my Dear, Forget ur Fear,
Let all ur Dreams be Clear,
Never put Tear, Please Hear,
I want to tell one thing in ur Ear
Wishing u a very Happy “NEW YEAR”!

May The Year 2011 Bring for You…. Happiness,Success and filled with Peace,Hope n Togetherness of your Family n Friends….

i want to be happy

Good health for my family and friends

I wish my husband’s children would stop being so mean and violent and act like normal children. I want to love my husband just as much as before we got full custody.

to get wonderful loving girl for me.. :))

Keep some good try. Wish u happy new year

happy new year to all

My wish is to find a good job.

i wish i get a big outsourcing contract this year that i can re-set up my business and get back on track.

I wish that I be firm in whatever I do. That I can do justice to all my endeavors. That I give my 100%, my wholehearted efforts to each and every minutest work I do.

My wish for 2011, is a little patience and understanding for everyone, empathy always helps.

To be able to afford a new house

“NEW YEAR 2011″
+NEW AIM + NEW ACHIEVEMENT + NEW DREAM +NEW IDEA + NEW THINKING +NEW AMBITION =NEW LIFE+SUCCESS HAPPY NEW YEAR!

let this year be terrorist free world

Wish the world walk forward in time with all its innocence and beauty where prevails only love, and hatred no longer found in the dictionary.

no

Wish u a very happy New Year Friends and make this year as a pleasant days…

I wish the economy would get better, so people can afford to pay their bills and live more comfortably again.

i wish, god makes life beautiful and very simple to all of us. and happy new year to world.

Be always at war with your vices, at peace with your neighbors, and let each new year find you a better man and I wish a very very prosperous new year.

i wish i would buy a house and car for my mom

I wish to have a new car.
This new year will be full of expectation in the field of investment.We concerned about US dollar. Hope this year will be a good for US dollar.

this year is very enjoyment life

Cheers to a New Year and another chance for us to get it right

to get married

Wishing all a meaningful,purposeful,healthier and prosperous New Year 2011.

WISH YOU A HAPPY NEW YEAR 2011 MAY BRING ALL HAPPINESS TO YOU

RAKKIMUTHU

In 2011 I wish for my family to get in a better spot financially and world peace.

Wish that economic conditions improve to the extent that the whole spectrum of society can benefit and improve themselves.

I want my divorce to be final and for my children to be happy.

This 2011 year is very good year for All with Health & Wealth.

I wish that things for my family would get better. We have had a terrible year and I am wishing that we can look forward to a better and brighter 2011.

This year bring peace and prosperity to all. Everyone attain the greatest goal of life. May god gives us meaning of life to all.

This new year will bring happy in everyone’s life and peace among countries.

I hope for bipartisanship and for people to realize blowing up other people isn’t the best way to get their point across. It just makes everyone else angry.

A better economy would be nice too

I wish that in 2011 the government will work together as a TEAM for the betterment of all. Peace in the world.

i wish you all happy new year. may god bless all……

no i wish for you

I wish that my family will move into our own house and we can be successful in getting good jobs for our future.

I wish my girl comes back to me

Wish You Happy New Year for All, especially to the workers and requester’s of Mturk.

Greetings!!!

Wishing you and your family a very happy and prosperous NEW YEAR – 2011

May this New Year bring many opportunities your way, to explore every joy of life and may your resolutions for the days ahead stay firm, turning all your dreams into reality and all your efforts into great achievements.

Wish u a Happy and Prosperous New Year 2011….

Wishing u lots of happiness..Success..and Love

and Good Health…….

Wish you a very very happy new year

WISHING YOU ALL A VERY HAPPY & PROSPEROUS NEW YEAR…….

I wish in this 2011 is to be happy,have a good health and also my family.

I pray that the coming year should bring peace, happiness and good health.

I wish for my family to continue to be healthy, for my cars to continue running, and for no 10th Anniversary attacks this upcoming September.

be a good and help full for my family .

Happy and Prosperous New Year

New day new morning new hope new efforts new success and new feeling,a new year a new begening, but old friends are never forgotten, i think all who touched my life and made life meaningful with their support, i pray god to give u a verry “HAPPY AND SUCCESSFUL NEW YEAR”.

Be a good person,as good as no one

wish this new year brings cheers and happiness to one and all.

For the year 2011 I simply wish for the ability to support my family properly and have a healthier year.

I wish I have luck with getting a better job.

Greater awareness of climate change, and a recovering US economy.

this new year 2011 brings you all prosperous and happiness in your life…….

happy newyear wishes to all the beautiful hearts in the world in the world.god bless you all.

wishing every happy new year to all my pals and relatives and to all my lovely countrymen

XMPP untethered – serverless messaging in the core?

In the XMPP session at last february’s FOSDEM I gave a brief demo of some NoTube work on how TV-style remote controls might look with XMPP providing their communication link. For the TV part, I showed Boxee, with a tiny Python script exposing some of its localhost HTTP API to the wider network via XMPP. For the client, I have a ‘my first iphone app‘ approximation of a remote control that speaks a vapourware XMPP remote control protocol, “Buttons”.

The point of all this is about breaking open the Web-TV environment, so that different people and groups get to innovate without having to be colleagues or close-nit business partners. Control your Apple TV with your Google Android phone; or your Google TV with your Apple iPad, or your Boxee box with either. Write smart linking and bookmarking and annotation apps that improve TV for all viewers, rather than only those who’ve bought from the same company as you. I guess I managed to communicate something of this because people clapped generously when my iphone app managed to pause Boxee. This post is about how we might get from evocative but toy demos to a useful and usable protocol, and about one of our largest obstacles: XMPP’s focus on server-mediated communications.

So what happened when I hit the ‘pause’ button on the iphone remote app? Well, the app was already connected to the XMPP network, e.g. signed in as bob.notube@gmail.com via Google Talk’s servers. And so an XMPP stanza flowed out from the room we were in, across to Google somewhere, and then via XMPP server-to-server protocol over to my self-run XMPP server (an ejabberd hosted on Amazon EC2’s east USA zone somewhere). And from there, the message returned finally to Brussels, flowing through whichever Python library I was using to Boxee (signed in as buttons@foaf.tv), causing the video to pause. This happened quite quickly, and generally very quickly; but sometimes it can take more than a second. This can be very frustrating, and while there are workaround (keep-alive messages, smart code that ignores sequences of buffered ‘Pause!’ messages, apps that download metadata and bring more UI to the second screen, …), the problem has a simple cause: it just doesn’t make sense for a ‘pause’ message to cross the atlantic twice, and pass through two XMPP servers, on its the way across the living room from remote control to TV.

But first – why are we even using XMPP at all, rather than say HTTP? Partly because XMPP lets us easily address devices on home networks, that aren’t publically exposed as running a Web server. Partly for the symmetry of the protocol, since ipads, touch tables, smart phones, TVs and media centres all can host and play media items on their own displays, and we may have several such devices in a home setting that need to be in touch with one another. There’s also a certain lazyness; XMPP already defines lots of useful pieces, like buddylist rosters, pubsub notifications, group chats; it has an active and friendly community, and it comes with a healthy collection of tools and libraries. My own interests are around exploring and collectively annotating the huge archives of content that are slowly coming online, and an expectation that this could be a more shared experience, so I’m following an intuition that XMPP provides more useful ‘raw materials’ for social content exploration than raw HTTP. That said, many elements of remote control can be defined and implemented in either environment. But for today, I’m concentrating on the XMPP side.

So back at FOSDEM I raised a couple of concerns, as a long-term XMPP well-wisher but non-insider.

The first was that the technology presents itself as a daunting collection of extensions, each of which might or might not be supported in some toolkit. To this someone (likely Dave Cridland) responded with the reassuring observation that most of these could be implemented by 3rd party app developer simply reading/writing XMPP stanzas. And that in fact pretty much the only ‘core’ piece of XMPP that wasn’t treated as core in most toolkits was the serverless, point-to-point XEP-0174 ‘serverless messaging‘ mode. Everything else, the rest of us mortals could hack in application code. For serverless messaging we are left waiting and hoping for the toolkit maintainers to wire things in, as it generally requires fairly intimate knowledge of the relevant XMPP library.

My second point was in fact related: that if XMPP tools offered better support for serverless operation, then it would open up lots of interesting application options. That we certainly need it for the TV remotes use case to be a credible use of XMPP. Beyond TV remotes, there are obvious applications in the area of open, decentralised social networking. The recent buzz around things like StatusNet, GNU Social, Diaspora*, WebID, OneSocialWeb, alongside the old stuff like FOAF, shows serious interest in letting users take more decentralised control of their online social behaviour. Whether the two parties are in the same room on the same LAN, or halfway around the world from each other, XMPP and its huge collection of field-tested, code-supported extensions is relevant, even when those parties prefer to communicate directly rather than via servers.

With XMPP, app party developers have a well-defined framework into which they can drop ad-hoc stanzas of information; whether it’s a vCard or details of recently played music. This seems too useful a system to reserve solely for communications that are mediated by a server. And indeed, XMPP in theory is not tied to servers; the XEP-0174 spec tells us both how to do local-network bonjour-style discovery, and how to layer XMPP on top of any communication channel that allows XML stanzas to flow back and forth.

From the abstract,

This specification defines how to communicate over local or wide-area networks using the principles of zero-configuration networking for endpoint discovery and the syntax of XML streams and XMPP messaging for real-time communication. This method uses DNS-based Service Discovery and Multicast DNS to discover entities that support the protocol, including their IP addresses and preferred ports. Any two entities can then negotiate a serverless connection using XML streams in order to exchange XMPP message and IQ stanzas.

But somehow this remains a niche use of XMPP. Many of the toolkits have some support for it, perhaps as work-in-progress or a patch, but it remains somewhat ‘out there’ rather than core to the XMPP approach. I’d love to see this change in 2011. The 0174 spec combines a few themes; it talks a lot about discovery, motivated in part by trade-fair and conference type scenarios. When your Apple laptop finds people locally on some network to chat with by “Bonjour”, it’s doing more or less XEP-0174. For the TV remote scenario, I’m interested in having nodes from a normal XMPP network drop down and “re-discover” themselves in a hopefully-lower-latency point to point mode (within some LAN or across the Internet, or between NAT-protected home LANs). There are lots of scenarios when having a server in the loop isn’t needed, or adds cost and risk (latency, single point of failure, privacy concerns).

XEP-0174 continues,

6. Initiating an XML Stream
In order to exchange serverless messages, the initiator and
recipient MUST first establish XML streams between themselves,
as is familiar from RFC 3920.
First, the initiator opens a TCP connection at the IP address
and port discovered via the DNS lookup for an entity and opens
an XML stream to the recipient, which SHOULD include 'to' and
'from' address. [...]

This sounds pretty precise; point-to-point communication is over TCP.  The Security Considerations section discussed some of the different constraints for XMPP in serverless mode, and states that …

To secure communications between serverless entities, it is RECOMMENDED to negotiate the use of TLS and SASL for the XML stream as described in RFC 3920

Having stumbled across Datagram TLS (wikipedia, design writeup), I wonder whether that might also be an option for the layer providing the XML stream between entities.  For example, the chownat tool shows a UDP-based trick for establishing bidirectional communication between entities, even when they’re both behind NAT. I can’t help but wonder whether XMPP could be layered somehow on top of that (OpenSSL libraries have Datagram TLS support already, apparently). There are also other mechanisms I’ve been discussing with Mo McRoberts and Libby Miller lately, e.g. Mo’s dynamic dns / pubkeys idea, or his trick of running an XMPP server in the home, and opening it up via UPnP. But that’s for another time.

So back on my main theme: XMPP is holding itself back by always emphasising the server-mediated role. XEP-0174 has the feel of an afterthought rather than a core part of what the XMPP community offers to the wider technology scene, and the support for it in toolkits lags similarly. I’d love to hear from ‘live and breath XMPP’ folk what exactly they think is needed before it can become a more central part of the XMPP world.

From the TV remotes use case we have a few constraints, such as the need to associate identities established in different environments (eg. via public key). If xmpp:danbri-ipad@danbri.org is already on the server-based XMPP roster of xmpp:nevali-tv@nevali.net, can pubkey info in their XMPP vCards be used to help re-establish trusted communications when the devices find themselves connected in the same LAN? It seems just plain nuts to have a remote control communicate with another box in the same room via transatlantic links through Google Talk and Amazon EC2, and yet that’s the general pattern of normal XMPP communications. What would it take to have more out-of-the-box support for XEP-0174 from the XMPP toolkits? Some combination of beer, money, or a shared sense that this is worth doing and that XMPP has huge potential beyond the server-based communications model it grew from?

‘Republic of Letters’ in R / Custom Widgets for Second Screen TV navigation trails

As ever, I write one post that perhaps should’ve been two. This is about the use and linking of datasets that aid ‘second screen’ (smartphone, tablet) TV remotes, and it takes as a quick example a navigation widget and underlying dataset that show us how we might expect to navigate TV archives, in some future age when TV lives more fully in the World Wide Web. I argue that access to the ‘raw data‘ and frameworks for embedding visualisation apps are of equal importance when thinking about innovative ways of exploring the ever-growing archives. All of this comes from many discussions with my NoTube colleagues and other collaborators; rambling scribblyness is all my own.

Ben Hammersley points us at a lovely Flash visualization http://www.stanford.edu/group/toolingup/rplviz/”>Mapping the Republic of Letters”.

From the YouTube overview, “Researchers map thousands of letters exchanged in the 18th century’s “Republic of Letters” and learn at a glance what it once took a lifetime of study to comprehend.”


Mapping the Republic of Letters has at its center a multidimensional data set which spans 300 years and nearly 100,000 letters. We use computing tools that help us to measure and analyze data quantitatively, though that will not take us to our goal. While we use software and computing techniques that were designed for scientific and statistical methods, we are seeking to develop computing tools to enhance humanistic methods, to help us to explore qualitative aspects of the Republic of Letters. The subject of our study and the nature of the material require it. The collections of correspondence and records of travel from this period are incomplete. Of that incomplete material only a fraction has been digitized and is available to us. Making connections and resolving ambiguities in the data is something that can only be done with the help of computing, but cannot be done by computing alone. (from ‘methods and philosophy‘)


screenshot of Republic of Letters app, showing social network links superimposed on map of historical western Europe


See their detailed writeup for more on this fascinating and quite beautiful work. As I’m working lately on linking TV content more deeply into the Web, and on ‘second screen’ navigation, this struck me as just the kind of interface which it ought to be possible to re-use on a tablet PC to explore TV archives. Forgetting for the moment difficulties with Flash on iPads and so on, the idea roughly is that it would be great to embed such a visualization within a TV watching environment, such that when the ‘republic of letters’ widget is focussed on some person, place, or topic, we should have the opportunity to scan the available TV archives for related materials to show.

So a glance at Chrome’s ‘developer tools’ panel gave me a link to the underlying data used by the visualisation. I don’t know exactly whose it is, nor how they want it used, so please treat it with respect. Still, there it is, sat in the Web, in tab-separated format, begging to be used. There’s a lot you can do with the Flash application that I’ve barely touched, but I’m intrigued by the underlying dataset. In particular, where they have the string “Tonson, Jacob”, the data linker in me wants to see a Wikipedia or DBpedia link, since they provide explanation, context, related people, places and themes; all precious assets when trying to scrape together related TV materials to inform, educate or entertain someone with. From a few test searches, it turns out that (many? most?) the correspondents are quite easily matched to Wikipedia: William Congreve, Montagu, 1st earl of Halifax, CharlesHough, bishop of Worcester, John; Stanyan, Abraham;  … Voltaire and others. But what about the data?

Lately I’ve been learning just a little about R, a language used mainly for statistics and related analysis. Here’s what it’ll do ‘out of the box’, in untrained hands:

letters<-read.csv('data.txt',sep='\t', header=TRUE)
v_author = letters$Author=="Voltaire"
v_letters = letters[v_author, ]
Where were Voltaire’s letters sent?
> cbind(summary(v_letters$dest_country))
[,1]
Austria            2
Belgium            6
Canada             0
Denmark            0
England           26
France          1312
Germany           97
India              0
Ireland            0
Italy             68
Netherlands       22
Portugal           0
Russia             5
Scotland           0
Spain              1
Sweden             0
Switzerland      342
The Netherlands    1
Turkey             0
United States      0
Wales              0
As the overview and video in the ‘Republic of Letters‘ site points out (“Tracking 18th-century “social network” through letters”), the patterns of correspondence eg. between Voltaire and e.g. England, Scotland and Ireland jumps out of the data (and more so its visualisation). There are countless ways this information could be explored, presented, sliced-and-diced. Only a custom app can really make the most of it, and the Republic of Letters work goes a long way in that direction. They also note that
The requirements of our project are very much in sync with current work being done in the linked-data/ semantic web community and in the data visualization community, which is why collaboration with computer science has been critical to our project from the start.
So the raw data in the Web here is a simple table; while we could spend time arguing about whether it would better be expressed in JSON, XML or an RDF notation, I’d rather see some discussion around what we can do with this information. In particular, I’m intrigued by the possibilities of R alongside the data-linking habits that come with RDF. If anyone manages to tease anything interesting from this dataset, perhaps mixed in with DBpedia, do post your results.
And of course there are always other datasets to examine; for example see the Darwin correspondence archives, or the Open Knowledge Foundation’s Open Correspondence project which has a Dickens-based pilot. While it is wonderful having UI that is tuned to the particulars of some dataset, it is also great when we can re-use UI code to explore similarly structured data from elsewhere. On both the data side and the UI side, this is expensive, tough work to do well. My current concern is to maximise re-use of both UI and data for the particular circumstances of second-screen TV navigation, a scenario rarely a first priority for anyone!
My hope is that custom navigation widgets for this sort of data will be natural components of next-generation TV remote controls, and that TV archives (and other collections) will open up enough of their metadata to draw in (possibly paying) viewers. To achieve this, we need the raw data on both sides to be as connectable as possible, so that application authors can spend their time thinking about what their users really need and can use, rather than on whether they’ve got the ‘right’ Henry Newton.
If we get it right, there’s a central role for librarianship and archivists in curating the public, linked datasets that tell us about the people, places and topics that will allow us to make new navigation trails through Web-connected television, literature and encyclopedia content. And we’ll also see new roles for custom visualizations, once we figure out an embedding framework for TV widgets that lets them communicate with a display system, with other users in the same room or community, and that is designed for cross-referencing datasets that talk about the same entities, topics, places etc.
As I mentioned regarding Lonclass and UDC, collaboration around open shared data often takes place in a furtive atmosphere of guilt and uncertainty. Is it OK to point to the underlying data behind a fantastic visualisation? How can we make sure the hard work that goes into that data curation is acknowledged and rewarded, even while its results flow more freely around the Web, and end up in places (your TV remote!) that may never have been anticipated?

Subject classification and Statistics

Subject classification and statistics share some common problems. This post takes a small example discussed at this week’s ODaF event on “Semantic Statistics” in Tilberg, and explores its expression coded in the Universal Decimal Classification (UDC). UDC supports faceted description, providing an abstract grammar allowing sentence-like subject descriptions to be composed from the “raw materials” defined in its vocabulary scheme.

This makes the mapping of UDC (and to some extent also Dewey classifications)  into W3C’s SKOS somewhat lossy, since patterns and conventions for documenting these complex, composed structures are not yet well established. In the NoTube project we are looking into this in a TV context, in large part because the BBC archives make extensive use of UDC via their Lonclass scheme; see my ‘investigating Lonclass‘ and UDC seminar talk for more on those scenarios. Until this week I hadn’t thought enough about the potential for using this to link deep into statistical datasets.

One of the examples discussed on Tuesday was as follows (via Richard Cyganiak):

“There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″

There was much interesting discussion in Tilburg about the proper scope and role of Linked Data techniques for sharing this kind of statistical data. Do we use RDF essentially as metadata, to find ‘black boxes’ full of stats, or do we use RDF to try to capture something of what the statistics are telling us about the world? When do we use RDF as simple factual data directly about the world (eg. school X has N pupils [currently; or at time t]), and when does it become a carrier for raw numeric data whose meaning is not so directly expressed at the factual level?

The state of the art in applying RDF here seems to be SDMX-RDF, see Richard’s slides. The SDMX-RDF work uses SKOS to capture code lists, to describe cross-domain concepts and to indicate subject matter.

Given all this, I thought it would be worth taking this tiny example and looking at how it might look in UDC, both as an example of the ‘compositional semantics’ some of us hope to capture in extended SKOS descriptions, but also to explore scenarios that cross-link numeric data with the bibliographic materials that can be found via library classification techniques such as UDC. So I asked the ever-helpful Aida Slavic (editor in chief of the UDC), who talked me through how this example data item looks from a UDC perspective.

I asked,

So I’ve just got home from a meeting on semweb/stats. These folk encode data values with stuff like “There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″. How much of that could have a UDC coding? I guess I should ask, how would subject index a book whose main topic was “occupational injuries in the Washington DC metro area in 2008″?

Aida’s reply (posted with permission):

You can present all of it & much more using UDC. When you encode a subject like this in UDC you store much more information than your proposed sentence actually contains. So my decision of how to ‘translate this into udc’ would depend on learning more about the actual text and the context of the message it conveys, implied audience/purpose, the field of expertise for which the information in the document may be relevant etc. I would probably wonder whether this is a research report, study, news article, textbook, radio broadcast?

Not knowing more then you said I can play with the following: 331.46(735.215.2/.4)”2008

Accidents at work — Washington metropolitan area — year 2008
or a bit more detailed:  331.46-053.18(735.215.2/.4)”2008
Accidents at work — dead persons – Washington metropolitan area — year 2008
[you can say the number of dead persons but this is not pertinent from point of view of indexing and retrieval]

…or maybe (depending what is in the content and what is the main message of the text) and because you used the expression ‘fatal injuries’ this may imply that this is more health and safety/ prevention area in health hygiene which is in medicine.

The UDC structures composed here are:

TIME “2008”

PLACE (735.215.2/.4)  Counties in the Washington metropolitan area

TOPIC 1
331     Labour. Employment. Work. Labour economics. Organization of  labour
331.4     Working environment. Workplace design. Occupational safety.  Hygiene at work. Accidents at work
331.46  Accidents at work ==> 614.8

TOPIC 2
614   Prophylaxis. Public health measures. Preventive treatment
614.8    Accidents. Risks. Hazards. Accident prevention. Persona protection. Safety
614.8.069    Fatal accidents

NB – classification provides a bit more context and is more precise than words when it comes to presenting content i.e. if the content is focused on health and safety regulation and occupation health then the choice of numbers and their order would be different e.g. 614.8.069:331.46-053.18 [relationship between] health & safety policies in prevention of fatal injuries and accidents at work.

So when you read  UDC number 331.46 you do not see only e.g. ‘accidents at work’ but  ==>  ‘accidents at work < occupational health/safety < labour economics, labour organization < economy
and when you see UDC number 614.8  it is not only fatal accidents but rather ==> ‘fatal accidents < accident prevention, safety, hazards < Public health and hygiene. Accident prevention

When you see (735.2….) you do not only see Washington but also United States, North America

So why is this interesting? A couple of reasons…

1. Each of these complex codes combines several different hierarchically organized components; just as they can be used to explore bibliographic materials, similar approaches might be of value for navigating the growing collections of public statistical data. If SKOS is to be extended / improved to better support subject classification structures, we should take care also to consider use cases from the world of statistics and numeric data sharing.

2. Multilingual aspects. There are plans to expose SKOS data for the upper levels of UDC. An HTML interface to this “UDC summary” is already available online, and includes collected translations of textual labels in many languages (see progress report) . For example, we can look up 331.4 and find (in hierarchical context) definitions in English (“Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work”), alongside e.g. Spanish (“Entorno del trabajo. Diseño del lugar de trabajo. Seguridad laboral. Higiene laboral. Accidentes de trabajo”), CroatianArmenian, …

Linked Data is about sharing work; if someone else has gone to the trouble of making such translations, it is probably worth exploring ways of re-using them. Numeric data is (in theory) linguistically neutral; this should make linking to translations particularly attractive. Much of the work around RDF and stats is about providing sufficient context to the raw values to help us understand what is really meant by “66” in some particular dataset. By exploiting SDMX-RDF’s use of SKOS, it should be possible to go further and to link out to the wider literature on workplace fatalities. This kind of topical linking should work in both directions: exploring out from numeric data to related research, debate and findings, but also coming in and finding relevant datasets that are cross-referenced from books, articles and working papers. W3C recently launched a Library Linked Data group, I look forward to learning more about how libraries are thinking about connecting numeric and non-numeric information.

Google Data APIs (and partial YouTube) supporting OAuth

Building on last month’s announcement of OAuth for the Google Contacts API, this from Wei on the oauth list:

Just want to let you know that we officially support OAuth for all Google Data APIs.

See blog post:

You’ll now be able to use standard OAuth libraries to write code that authenticates users to any of the Google Data APIs, such as Google Calendar Data API, Blogger Data API, Picasa Web Albums Data API, or Google Contacts Data API. This should reduce the amount of duplicate code that you need to write, and make it easier for you to write applications and tools that work with a variety of services from multiple providers. [...]

There’s also a footnote, “* OAuth also currently works for YouTube accounts that are linked to a Google Account when using the YouTube Data API.”

See the documentation for more details.

On the YouTube front, I have no idea what % of their accounts are linked to Google; lots I guess. Some interesting parts of the YouTube API: retrieve user profiles, access/edit contacts, find videos uploaded by a particular user or favourited by them plus of course per-video metadata (categories, keywords, tags, etc). There’s a lot you could do with this, in particular it should be possible to find out more about a user by looking at the metadata for the videos they favourite.

Evidence-based profiles are often better than those that are merely asserted, without being grounded in real activity. The list of people I actively exchange mail or IM with is more interesting to me than the list of people I’ve added on Facebook or Orkut; the same applies with profiles versus tag-harvesting. This is why the combination of last.fm’s knowledge of my music listening behaviour with the BBC’s categorisation of MusicBrainz artist IDs is more interesting than asking me to type my ‘favourite band’ into a box. Finding out which bands I’ve friended on MySpace would also be a nice piece of evidence to throw into that mix (and possible, since MusicBrainz also notes MySpace URIs).

So what do these profiles look like? The YouTube ‘retrieve a profile‘ API documentation has an example. It’s Atom-encoded, and beyond the video stuff mentioned above has fields like:

  <yt:age>33</yt:age>
  <yt:username>andyland74</yt:username>
  <yt:books>Catch-22</yt:books>
  <yt:gender>m</yt:gender>
  <yt:company>Google</yt:company>
  <yt:hobbies>Testing YouTube APIs</yt:hobbies>
  <yt:location>US</yt:location>
  <yt:movies>Aqua Teen Hungerforce</yt:movies>
  <yt:music>Elliott Smith</yt:music>
  <yt:occupation>Technical Writer</yt:occupation>
  <yt:school>University of North Carolina</yt:school>
  <media:thumbnail url='http://i.ytimg.com/vi/YFbSxcdOL-w/default.jpg'/>
  <yt:statistics viewCount='9' videoWatchCount='21' subscriberCount='1'
    lastWebAccess='2008-02-25T16:03:38.000-08:00'/>

Not a million miles away from the OpenSocial schema I was looking at yesterday, btw.

I haven’t yet found where it says what I can and can’t do with this information…

Obama Web jobs in Boston

How could this not be a fun way to spend 6 months?

Obama for America is looking for exceptionally talented web developers who want to play a key role in a historic political campaign and help elect Barack Obama as the next President of the United States.

This six-month opportunity will allow you to:

  • Create software tools which will enable an unprecedented nationwide voter contact and mobilization effort
  • Help build and run the largest online, grassroots fundraising operation in the history of American politics
  • Introduce cutting-edge social networking and online organizing to the democratic process by empowering everyday people to participate on My.BarackObama

They also have a security expert position open.

Successful candidates will join the development team in Boston, MA.

Almost makes me wish I was a US Citizen. Sorry ma’am

Obama

I was really very impressed by Obama’s speech this week. And somewhat suprised to hear a major US politician speak on complex, subtle issues in thoughtful, nuanced terms. For non USAmericans, I think it’s often hard to empathise with US-style patriotism; in particular, the seeming impossibility of seeing the US as anything other than a shining beacon of goodness, as the world’s policeman. Watching the warmongering, flag-waving television news in the US in 2001/2 terrified me, and left me feeling like an alien in a strange land. But this speech did give me a vivid sense of an America that one could admire, even aspire to live in, and one that was a lot more honest with us all about its failings and difficulties, as well as justifiably proud of its many strengths. I really think this was a landmark speech, and one that shows the US at its best.

I was reading around the various responses, and so here’s a quick link-dump. Daily show; The Onion; Andrew Sullivan; Juan Cole; Fox News.

And also Daily Kos quoting of all people Mick Huckabee:

And one other thing I think we’ve gotta remember. As easy as it is for those of us who are white, to look back and say “That’s a terrible statement!”…I grew up in a very segregated south. And I think that you have to cut some slack — and I’m gonna be probably the only Conservative in America who’s gonna say something like this, but I’m just tellin’ you — we’ve gotta cut some slack to people who grew up being called names, being told “you have to sit in the balcony when you go to the movie. You have to go to the back door to go into the restaurant. And you can’t sit out there with everyone else. There’s a separate waiting room in the doctor’s office. Here’s where you sit on the bus…” And you know what? Sometimes people do have a chip on their shoulder and resentment. And you have to just say, I probably would too. I probably would too. In fact, I may have had more of a chip on my shoulder had it been me.

Quite so. And Huckabee deserves praise for acknowledging this. A similar perspective would bring some rationality to foreign policy discussions too. Understandably, most of the commentary we’ve seen on this speech have been on US domestic politics. But one point that seems to have gone underemphasised in the commentary I’ve read (even beyond The Onion!) is that for all the folk inside the US sympathising with Wright’s “God Damn America” outburst, there are hundreds or thousands or more out here in the rest of the world who are frustrated, angry and outraged by the actions of successive US governments. Giving the world a US president who seems capable of acknowledging this and beginning to address it would be a breath of fresh air. Elect him already! :) (and not that guy who jokes about killing my friends, please…).