This is a quick visual teaser for some archive.org-related work I’m doing with NoTube colleagues, and a collaboration with Kingsley Idehen on navigating it.
In NoTube we are trying to match people and TV content by using rich linked data representations of both. I love Archive.org and with their help have crawled an experimental subset of the video-related metadata for the Archive. I’ve also used a couple of other sources; Sean P. Aune’s list of 40 great movies, and the Wikipedia page listing US public domain films. I fixed, merged and scraped until I had a reasonable sample dataset for testing.
I wanted to test the Microsoft Pivot Viewer (a Silverlight control), and since OpenLink’s Virtuoso package now has built-in support, I got talking with Kingsley and we ended up with the following demo. Since not everyone has Silverlight, and this is just a rough prototype that may be offline, I’ve made a few screenshots. The real thing is very visual, with animated zooms and transitions, but screenshots give the basic idea.
Notes: the core dataset for now is just links between archive.org entries and Wikipedia/dbpedia pages. In NoTube we’ll also try Lupedia, Zemanta, Reuter’s OpenCalais services on the Archive.org descriptions to see if they suggest other useful links and categories, as well as any other enrichment sources (delicious tags, machine learning) we can find. There is also more metadata from the Archive that we should also be using.
This simple preview simply shows how one extra fact per Archived item creates new opportunities for navigation, discovery and understanding. Note that the UI is in no way tuned to be TV, video or archive specific; rather it just lets you explore a group of items by their ‘facets’ or common properties. It also reveals that wiki data is rather chaotic, however some fields (release date, runtime, director, star etc.) are reliably present. And of course, since the data is from Wikipedia, users can always fix the data.
You often hear Linked Data enthusiasts talk about data “silos”, and the need to interconnect them. All that means here, is that when collections are linked, then improvements to information on one side of the link bring improvements automatically to the other. When a Wikipedia page about a director, actor or movie is improved, it now also improves our means of navigating Archive.org’s wonderful collection. And when someone contributes new video or new HTML5-powered players to the Archive, they’re also enriching the Encyclopedia too.
One thing to mention is that everything here comes from the Wikipedia data that is automatically extracted from by DBpedia, and that currently the extractors are not working perfectly on all films. So it should get better in the future. I also added a lot of the image links myself, semi-automatically. For now, this navigation is much more factually-based than topic; however we do have Wikipedia categories for each film, director, studio etc., and these have been mapped to other category systems (formal and informal), so there’s a lot of other directions to explore.
What else can we do? How about flip the tiled barchart to organize by the film’s distributor, and constrain the ‘release date‘ facet to the 1940s:
That’s nice. But remember that with Linked Data, you’re always dealing with a subset of data. It’s hard to know (and it’s hard for the interface designers to show us) when you have all the relevant data in hand. In this case, we can see what this is telling us about the videos currently available within the demo. But does it tell us anything interesting about all the films in the Archive? All the films in the world? Maybe a little, but interpretation is difficult.
Next: zoom in to a specific item. The legendary Plan 9 from Outer Space (wikipedia / dbpedia).
Note the HTML-based info panel on the right hand side. In this case it’s automatically generated by Virtuoso from properties of the item. A TV-oriented version would be less generic.
Finally, we can explore the collection by constraining the timeline to show us items organized according to release date, for some facet. Here we show it picking out the career of one Edward J. Kay, at least as far as he shows up as composer of items in this collection:
Now turning back to Wikipedia to learn about ‘Edward J. Kay’, I find he has no entry (beyond these passing mentions of his name) in the English Wikipedia, despite his work on The Ape Man, The Fatal Hour, and other films. While the German Wikipedia does honour him with an entry, I wonder whether this kind of Linked Data navigation will change the dynamics of the ‘deletionism‘ debates at Wikipedia. Firstly by showing that structured data managed elsewhere can enrich the Wikipedia (and vice-versa), removing some pressure for a single Wiki to cover everything. Secondly it provides a tool to stand further back from the data and view things in a larger context; a context where for example Edward J. Kay’s achievements become clearer.
Much like Freebase Parallax, the Pivot viewer hints at a future in which we explore data by navigating from sets of things to other sets of things - like the set of film’s Edward J. Kay contributed to. Pivot doesn’t yet cover this, but it does very vividly present the potential for this kind of navigation, showing that navigation of films, TV shows and actors may be richer when it embraces more general mechanisms.