Archive.org TV metadata howto

The  following is composed from answers kindly supplied by Hank Bromley, Karen Coyle, George Oates, and Alexis Rossi from the archive.org team. I have mixed together various helpful replies and retro-fitted them to a howto/faq style summary.

I asked about APIs and data access for descriptions of the many and varied videos in Archive.org. This guide should help you get started with building things that use archive.org videos. Since the content up there is pretty much unencumbered, it is perfect for researchers looking for content to use in demos. Or something to watch in the evening.

To paraphrase their answer, it was roughly along these  lines:

  • you can do automated lookups of the search engine using a simple HTTP/JSON API
  • downloading a lot or everything is ok if you need or prefer to work locally, but please write careful scripts
  • hopefully the search interface is useful and can avoid you needing to do this

Short API overview: each archive entry that is a movie, video or tv file should have a type ‘movie’. Everything in the archive has a short textual ID, and an XML description at a predictable URL. You can find those by using the JSON flavour of the archive’s search engine, then download the XML (and content itself) at your leisure. Please cache where possible!

I was also pointed to http://deweymusic.org/ which is an example of a site that provides a new front-end for archive.org audio content – their live music collection. My hope in posting these notes here is to help people working on new interfaces to Web-connected TV explore archive.org materials in their work.

JSON API to archive.org services

See online documentation for JSON interface; if you’re happy working with the remote search engine and are building a Javascript-based app, this is perfect.

We have been moving the majority of our services from formats like XML, OAI and other to the more modern JSON format and method of client/server interaction.

How to … play well with others

As we do not have unlimited resources behind our services, we request that users try to cache results where they can for the more high traffic and popular installations/uses. 8-)

TV content in the archive

The archive contains a lot of video files; old movies, educational clips, all sorts of fun stuff. There is also some work on reflecting broadcast TV into the system:

First off, we do have some television content available on the site right now:
http://www.archive.org/details/tvarchive – It’s just a couple of SF gov channels, so the content itself is not terribly exciting.  But what IS cool is that this being recorded directly off air and then thrown into publicly available items on archive.org automatically.  We’re recording other channels as well, but we currently aren’t sure what we can make public and how.

See also televisionarchive.orghttp://www.archive.org/details/sept_11_tv_archive

How to… get all metadata

If you really would rather download all the metadata and put it in their own search engine or database, it’s simple to do:  get a list of the identifiers of all video items from the search engine (mediatype:movies), and for each one, fetch this file:

http://www.archive.org/download/{itemID}/{itemID}_meta.xml

So it’s a bit of work since you have to retrieve each metadata record separately, but perhaps it is easily programmable.

However, once you have the identifier for an item, you can automatically find the meta.xml for it (or the files.xml if that’s what you want).  So if the item is at:
http://www.archive.org/details/Sita_Sings_the_Blues
the meta.xml is at
http://www.archive.org/download/Sita_Sings_the_Blues/Sita_Sings_the_Blues_meta.xml
and the files.xml is at
http://www.archive.org/download/Sita_Sings_the_Blues/Sita_Sings_the_Blues_files.xml

This is true for every single item in the archive.

How to… get a list of all IDs

Use http://www.archive.org/advancedsearch.php

Basically, you put in a query, choose the metadata you want returned, then choose the format you’d like it delivered in (rss, csv, json, etc.).

Downsides to this method – you can only get about 10,000 items at once (you might be able to push it to 20,000) before it crashes on you, and you can only get the metadata fields listed.

How to… monitor updates with RSS?

Once you have a full dump, you can monitor incoming items via the RSS feed on this page:

http://www.archive.org/details/movies

Subtitles / closed captions

For the live TV collection, there should be extracted subtitles. Maybe I just found bad examples. (e.g

http://www.archive.org/details/SFGTV2_20100909_003000).

Todo: more info here!

What does the Archive search engine index?

In general *everything* in the meta.xml files is indexed in the IA search engine, and accessible for scripted queries at http://www.archive.org/advancedsearch.php.

But it may be that the search engine will support whatever queries you want to make, without your having to copy all the metadata to your own site.

How many “movies” are in the database?

Currently 314,624 “movies” items in the search engine. All tv and video items are supposed to be have “movies” for their mediatype, although there has been some leakage now and then.

Should I expect a valid XML file for each id?

eg.  “identifier”:”mosaic20031001″ seemed problematic.
There are definitely items on the archive that have extremely minimally filled outmeta.xml files.

Response from a trouble report:

“I looked at a couple of your examples, i.e. http://www.archive.org/details/HomeElec,  and they do have a meta.xml file in our system… but it ONLY contains a mediatype (movies) and identifier and nothing else.  That seems to be making our site freak out.  There are at least 800 items in movies that do not have a title.  There might be other minimal metadata that is required for us to think it’s a real item, but my guess is that if you did a search like this one you’d see fewer of those errors:
http://www.archive.org/search.php?query=mediatype%3Amovies%20AND%20title%3A[*%20TO%20*]

The other error you might see is “The item is not available due to issues with the item’s content.”  This is an item that has been taken down but for some reason it did not get taken out of the SE – it’s not super common, but it does happen.
I don’t think we’ve done anything with autocomplete on the Archive search engine, although one can use wildcards to find all possible completions by doing a query.  For example, the query:

http://www.archive.org/advancedsearch.php?q=mediatype%3Avideo+AND+title%3Aopen*&fl[]=identifier&fl[]=title&rows=10&page=1&output=json&save=yes

will match all items whose titles contain any words that start with “open” – that sample result of ten items shows titles containing “open,” “opening,” and “opener.”

How can I autocomplete against archive.org metadata?

Not at the moment.

“I believe autocomplete *has* been explored with the search engine on our “Open Library” sister site, openlibrary.org.”

How can I find interesting and well organized areas of the video archive?

I assume you’re looking for collections with pretty regular metadata to work on?  These collections tend to be fairly filled out:
http://www.archive.org/details/prelinger
http://www.archive.org/details/academic_films
http://www.archive.org/details/computerchronicles


Streaming Apple Events over XMPP

I’ve just posted a script that will re-route the OSX Apple Remote event stream out across XMPP using the Switchboard Ruby library, streaming click-down and click-up events from the device out to any endpoint identified by a Jabber/XMPP JID (i.e. Jabber ID). In my case, I’m connecting to XMPP as the user xmpp:buttons@foaf.tv, who is buddies with xmpp:bob.notube@gmail.com, ie. they are on each other’s Jabber rosters already. Currently I simply send a textual message with the button-press code; a real app would probably use an XMPP IQ stanza instead, which is oriented more towards machines than human readers.

The nice thing about this setup is that I can log in on another laptop to Gmail, run the Javascript Gmail / Google Talk chat UI, and hear a ‘beep’ whenever the event/message arrives in my browser. This is handy for informally testing the laggyness of the connection, which in turn is critical when designing remote control protocols: how chatty should they be? How much smarts should go into the client? which bit of the system really understands what the user is doing? Informally, the XMPP events seem pretty snappy, but I’d prefer to see some real statistics to understand what a UI might risk relying on.

What I’d like to do now is get a Strophe Javascript client running. This will attach to my Jabber server and allow these events to show up in HTML/Javascript apps…

Here’s sample output of the script (local copy but it looks the same remotely), in which I press and release quickly every button in turn:

Cornercase:osx danbri$ ./buttonhole_surfer.rb
starting event loop.

=> Switchboard started.
ButtonDownEvent: PLUS (0x1d)
ButtonUpEvent: PLUS (0x1d)
ButtonDownEvent: MINU (0x1e)
ButtonUpEvent: MINU (0x1e)
ButtonDownEvent: LEFT (0x17)
ButtonUpEvent: LEFT (0x17)
ButtonDownEvent: RIGH (0x16)
ButtonUpEvent: RIGH (0x16)
ButtonDownEvent: PLPZ (0x15)
ButtonUpEvent: PLPZ (0x15)
ButtonDownEvent: MENU (0x14)
ButtonUpEvent: MENU (0x14)
^C
Shutdown initiated.
Waiting for shutdown to complete.
Shutdown initiated.
Waiting for shutdown to complete.
Cornercase:osx danbri$

Mozilla Ubiquity

The are some interesting things going on at Mozilla Labs. Yesterday, Ubiquity was all over the mailing lists. You can think of it as “what the Humanized folks did next”, or as a commandline for the Web, or as a Webbier sibling to QuickSilver, the MacOSX utility. I prefer to think of it as the Mozilla add-on that distracted me all day. Ubiquity continues Mozilla’s exploration of the potential UI uses of its “awesome bar” (aka Location bar). Ubiquity is invoked on my Mac with alt-space, at which point it’ll enthusiastically try to autocomplete a verb-centric Webby task from whatever I type. It does this by consulting a pile of built-in and community-provided Javacript functions, which have access to the Web, your browser (hello, widget security fans)… and it also has access to UI, in terms of an overlaid preview window, as well as a context menu that can actually be genuinely contextual, ie. potentially sensitive to microformat and RDFa markup.

So it might help to think of ubiquity as a cross between The Hobbit, GreaseMonkeyBookmarklets, and Mozilla’s earlier forms of packaged addon. Ok, well it’s not very Hobbit, I just wanted an excuse for this screen grab. But it is about natural language interfaces to complex Webby datasources and services.

The basic idea here is that commands (triggered by some keyword) can be published in the Web as links to simple Javascript files that can be single-click added (without need for browser restart) by anyone trusting enough to add the code to their browser. Social/trust layers to help people avoid bad addons are in the works too.

I spent yesterday playing. There are some rough edges, but this is fun stuff for sure. The emphasis is on verbs, hence on doing, rather than solely on lookups, query and data access. Coupled with the dependency on third party Javascript, this is going to need some serious security attention. But but but… it’s so much fun to use and develop for. Something will shake out security-wise. Even if Ubiquity commands are only shared amongst trusting power users who have signed each other’s PGP keys, I think it’ll still have an important niche.

What did I make? A kind of stalk-a-tron, FOAF lookup tool. It currently only consults Google’s Social Graph API, an experimental service built from all the public FOAF and XFN on the Web plus some logic to figure out which account pages are held by the same person. My current demo simply retrieves associated URLs and photos, and displays them overlaid on the current page. If you can’t get it working via the Ubiquity auto-subscribe feature, try adding it by pasting the raw Javascript into the command-editor screen. See also the ‘sindice-term‘ lookup tool from Michael Hausenblas. It should be fun seeing how efforts like Bengee’s SPARQLScript work can be plugged in here, too.

OpenSocial schema extraction: via Javascript to RDF/OWL

OpenSocial’s API reference describes a number of classes (‘Person’, ‘Name’, ‘Email’, ‘Phone’, ‘Url’, ‘Organization’, ‘Address’, ‘Message’, ‘Activity’, ‘MediaItem’, ‘Activity’, …), each of which has various properties whose values are either strings, references to instances of other classes, or enumerations. I’d like to make them usable beyond the confines of OpenSocial, so I’m making an RDF/OWL version. OpenSocial’s schema is an attempt to provide an overarching model for much of present-day mainstream ‘social networking’ functionality, including dating, jobs etc. Such a broad effort is inevitably somewhat open-ended, and so may benefit from being linked to data from other complementary sources.

With a bit of help from the shindig-dev list, #opensocial IRC, and Kevin Brown and Kevin Marks, I’ve tracked down the source files used to represent OpenSocial’s data schemas: they’re in the opensocial-resources SVN repository on code.google.com. There is also a downstream copy in the Apache Shindig SVN repo (I’m not very clear on how versioning and evolution is managed between the two). They’re Javascript files, structured so that documentation can be generated via javadoc. The Shindig-PHP schema diagram I posted recently is a representation of this schema.

So – my RDF version. At the moment it is merely a list of classes and their properties (expressed using via rdfs:domain), written using RDFa/HTML. I don’t yet define rdfs:range for any of these, nor handle the enumerated values (opensocial.Enum.Smoker, opensocial.Enum.Drinker, opensocial.Enum.Gender, opensocial.Enum.LookingFor, opensocial.Enum.Presence) that are defined in enum.js.

The code is all in the FOAF SVN, and accessible via “svn co http://svn.foaf-project.org/foaftown/opensocial/vocab/”. I’ve also taken the liberty of including a copy of the OpenSocial *.js files, and Mozilla’s Rhino Javascript interpreter js.jar in there too, for self-containedness.

The code in schemarama.js will simply generate an RDFA/XHTML page describing the schema. This can be checked using the W3C validator, or converted to RDF/XML with the pyRDFa service at W3C.

I’ve tested the output using the OwlSight/pellet service from Clark & Parsia, and with Protege 4. It’s basic but seems OK and a foundation to build from. Here’s a screenshot of the output loaded into Protege (which btw finds 10 classes and 99 properties).

An example view from protege, showing the class browser in one panel, and a few properties of Person in another.

OK so why might this be interesting?

  • Using OpenSocial-derrived vocabulary, OpenSocial-exported data in other contexts
    • databases (queryable via SPARQL)
    • mixed with FOAF
    • mixed with Microformats
    • published directly in RDFa/HTML
  • Mapping OpenSocial terms with other contact and social network schemas

This suggests some goals for continued exploration:

It should be possible to use “OpenSocial markup” in an ordinary homepage or blog (HTML or XHTML), drawing on any of the descriptive concepts they define, through using RDFa’s markup notation. As Mark Birbeck pointed out recently, RDFa is an empty vessel – it does not define any descriptive vocabulary. Instead, the RDF toolset offers an environment in which vocabulary from multiple independent sources can be mixed and merged quite freely. The hard work of the OpenSocial team in analysing social network schemas and finding commonalities, or of the Microformats scene in defining simple building-block vocabularies … these can hopefully be combined within a single environment.

Imagemap magic

I’ve always found HTML imagemaps to be a curiously neglected technology. They seem somehow to evoke the Web of the mid-to-late 90s, to be terribly ‘1.0’. But there’s glue in the old horse yet…

A client-side HTML imagemap lets you associate links (and via Javascript, behaviour) with regions of an image. As such, they’re a form of image metadata that can have applications including image search, Web accessibility and social networking. They’re also a poor cousin to the Web’s new vector image format, SVG. This morning I dug out some old work on this (much of which from Max, Libby, Jim all of whom btw are currently working at Joost; as am I, albeit part-time).

The first hurdle you hit when you want to play with HTML imagemaps is finding an editor that produces them. The fact that my blog post asking for MacOSX HTML imagemap editors is now top Google hit for “MacOSX HTML imagemap” pretty much says it all. Eventually I found (and paid for) one called YokMak that seems OK.

So the first experiment here, was to take a picture (of me) and make a simple HTML imagemap.

danbri being imagemapped

As a step towards treating this as re-usable metadata, here’s imagemap2svg.xslt from Max back in 2002. The results of running it with xsltproc are online: _output.svg (you need an SVG-happy browser). Firefox, Safari and Opera seem more or less happy with it (ie. they show the selected area against a pink background). This shows that imagemap data can be freed from the clutches of HTML, and repurposed. You can do similar things server-side using Apache Batik, a Java SVG toolkit. There are still a few 2002 examples floating around, showing how bits of the image can be described in RDF that includes imagemap info, and then manipulated using SVG tools driven from metadata.

Once we have this ability to pick out a region of an image (eg. photo) and tag it, it opens up a few fun directions. In the FOAF scene a few years ago, we had fun using RDF to tag image region parts with information about the things they depicted. But we didn’t really get into questions of surface-syntax, ie. how to maker rich claims about the image area directly within the HTML markup. These days, some combination of RDFa or microformats would probably be the thing to use (or perhaps GRDDL). I’ve sent mail to the RDFa group looking for help with this (see that message for various further related-work links too).

Specifically, I’d love to have some clean HTML markup that said, not just “this area of the photo is associated with the URI http://danbri.org/”, but “this area is the Person whose openid is danbri.org, … and this area depicts the thing that is the primary topic of http://en.wikipedia.org/wiki/Eiffel_Tower”. If we had this, I think we’d have some nice tools for finding images, for explaining images to people who can’t see them, and for connecting people and social networks through codepiction.

Codepiction

Ruby client for querying SPARQL REST services

I’ve started a Ruby conversion of Ivan Herman’s Python SPARQL client, itself inspired by Lee Feigenbaum’s Javascript library. These are tools which simply transmit a SPARQL query across the ‘net to a SPARQL-protocol database endpoint, and handle the unpacking of the results. These queries can result in yes/no responses, variable-to-value bindings (rather like in SQL), or in chunks of RDF data. The default resultset notation is a simple XML format; JSON results are also widely available.

All I’ve done so far, is to sit down with the core Python file from Ivan’s package, and slog through converting it brainlessly into rather similar Ruby. Take out “self” and “:” from method definitions, write “nil” instead of “None”, write “end” at the end of each method and class, use Ruby’s iteration idioms, and you’re most of the way there. Of course the fiddly detail is where we use external libraries: for URIs, network access, JSON and XML parsing. I’ve cobbled something together which could be the basis for reconstructing all the original functionality in Ivan’s code. Currently, it’s more proof of concept, but enough to be worth posting.

Files are in SVN and browsable online; there’s also a bundled up download for the curious, brave or helpful. The test script shows all we have at the moment: ability to choose XML or JSON results, and access result set binding rows. Other result formats (notably RDF; there’s no RDF/XML parser currently) aren’t handled, and various bits of the code conversion are incomplete. I’d be super happy if someone came along and helped finish this!

Update:  Ruby’s REXML parser is now clumsily wired in; you get get a REXML Document object (see xml.com writeup) as a way of navigating the resultset.

HTML Imagemap authoring tool for MacOSX?


I’ve searched around in vain for one. I want to annotate my FOAF spec diagram with mouseover text and links into the documentation. Most of the tools I find are a decade or more old, or pay-to-play. I remember the Gimp image editor can do imagemaps, but it crashes on startup on my MacBook. I did find a nice Javascript editor the other day, but it had no “undo” function, which made complex work impossible. Maybe I’ll try Amaya. Suggestions very much welcomed! Are imagemaps *that* uncool these days? They’re just SVG and image metadata in another notation… (and imho one of the the more interesting scenarios for HTML-based microformattery). That last link has missing photos and out of date SVG, but might still be of interest. A surviving screenshot included here.

Nearby: (from SWAD-Europe hacking days) an imagemap2svg.xslt thanks to Max Froumentin.

Update: I installed Amaya 9.99 for Intel MacOSX. Sadly it couldn’t even display my JPEG properly, although it did do a better job showing OmniGraffle’s SVG output than Firefox.