Archive.org TV metadata howto

The  following is composed from answers kindly supplied by Hank Bromley, Karen Coyle, George Oates, and Alexis Rossi from the archive.org team. I have mixed together various helpful replies and retro-fitted them to a howto/faq style summary.

I asked about APIs and data access for descriptions of the many and varied videos in Archive.org. This guide should help you get started with building things that use archive.org videos. Since the content up there is pretty much unencumbered, it is perfect for researchers looking for content to use in demos. Or something to watch in the evening.

To paraphrase their answer, it was roughly along these  lines:

  • you can do automated lookups of the search engine using a simple HTTP/JSON API
  • downloading a lot or everything is ok if you need or prefer to work locally, but please write careful scripts
  • hopefully the search interface is useful and can avoid you needing to do this

Short API overview: each archive entry that is a movie, video or tv file should have a type ‘movie’. Everything in the archive has a short textual ID, and an XML description at a predictable URL. You can find those by using the JSON flavour of the archive’s search engine, then download the XML (and content itself) at your leisure. Please cache where possible!

I was also pointed to http://deweymusic.org/ which is an example of a site that provides a new front-end for archive.org audio content – their live music collection. My hope in posting these notes here is to help people working on new interfaces to Web-connected TV explore archive.org materials in their work.

JSON API to archive.org services

See online documentation for JSON interface; if you’re happy working with the remote search engine and are building a Javascript-based app, this is perfect.

We have been moving the majority of our services from formats like XML, OAI and other to the more modern JSON format and method of client/server interaction.

How to … play well with others

As we do not have unlimited resources behind our services, we request that users try to cache results where they can for the more high traffic and popular installations/uses. 8-)

TV content in the archive

The archive contains a lot of video files; old movies, educational clips, all sorts of fun stuff. There is also some work on reflecting broadcast TV into the system:

First off, we do have some television content available on the site right now:
http://www.archive.org/details/tvarchive – It’s just a couple of SF gov channels, so the content itself is not terribly exciting.  But what IS cool is that this being recorded directly off air and then thrown into publicly available items on archive.org automatically.  We’re recording other channels as well, but we currently aren’t sure what we can make public and how.

See also televisionarchive.orghttp://www.archive.org/details/sept_11_tv_archive

How to… get all metadata

If you really would rather download all the metadata and put it in their own search engine or database, it’s simple to do:  get a list of the identifiers of all video items from the search engine (mediatype:movies), and for each one, fetch this file:

http://www.archive.org/download/{itemID}/{itemID}_meta.xml

So it’s a bit of work since you have to retrieve each metadata record separately, but perhaps it is easily programmable.

However, once you have the identifier for an item, you can automatically find the meta.xml for it (or the files.xml if that’s what you want).  So if the item is at:
http://www.archive.org/details/Sita_Sings_the_Blues
the meta.xml is at
http://www.archive.org/download/Sita_Sings_the_Blues/Sita_Sings_the_Blues_meta.xml
and the files.xml is at
http://www.archive.org/download/Sita_Sings_the_Blues/Sita_Sings_the_Blues_files.xml

This is true for every single item in the archive.

How to… get a list of all IDs

Use http://www.archive.org/advancedsearch.php

Basically, you put in a query, choose the metadata you want returned, then choose the format you’d like it delivered in (rss, csv, json, etc.).

Downsides to this method – you can only get about 10,000 items at once (you might be able to push it to 20,000) before it crashes on you, and you can only get the metadata fields listed.

How to… monitor updates with RSS?

Once you have a full dump, you can monitor incoming items via the RSS feed on this page:

http://www.archive.org/details/movies

Subtitles / closed captions

For the live TV collection, there should be extracted subtitles. Maybe I just found bad examples. (e.g

http://www.archive.org/details/SFGTV2_20100909_003000).

Todo: more info here!

What does the Archive search engine index?

In general *everything* in the meta.xml files is indexed in the IA search engine, and accessible for scripted queries at http://www.archive.org/advancedsearch.php.

But it may be that the search engine will support whatever queries you want to make, without your having to copy all the metadata to your own site.

How many “movies” are in the database?

Currently 314,624 “movies” items in the search engine. All tv and video items are supposed to be have “movies” for their mediatype, although there has been some leakage now and then.

Should I expect a valid XML file for each id?

eg.  “identifier”:”mosaic20031001″ seemed problematic.
There are definitely items on the archive that have extremely minimally filled outmeta.xml files.

Response from a trouble report:

“I looked at a couple of your examples, i.e. http://www.archive.org/details/HomeElec,  and they do have a meta.xml file in our system… but it ONLY contains a mediatype (movies) and identifier and nothing else.  That seems to be making our site freak out.  There are at least 800 items in movies that do not have a title.  There might be other minimal metadata that is required for us to think it’s a real item, but my guess is that if you did a search like this one you’d see fewer of those errors:
http://www.archive.org/search.php?query=mediatype%3Amovies%20AND%20title%3A[*%20TO%20*]

The other error you might see is “The item is not available due to issues with the item’s content.”  This is an item that has been taken down but for some reason it did not get taken out of the SE – it’s not super common, but it does happen.
I don’t think we’ve done anything with autocomplete on the Archive search engine, although one can use wildcards to find all possible completions by doing a query.  For example, the query:

http://www.archive.org/advancedsearch.php?q=mediatype%3Avideo+AND+title%3Aopen*&fl[]=identifier&fl[]=title&rows=10&page=1&output=json&save=yes

will match all items whose titles contain any words that start with “open” – that sample result of ten items shows titles containing “open,” “opening,” and “opener.”

How can I autocomplete against archive.org metadata?

Not at the moment.

“I believe autocomplete *has* been explored with the search engine on our “Open Library” sister site, openlibrary.org.”

How can I find interesting and well organized areas of the video archive?

I assume you’re looking for collections with pretty regular metadata to work on?  These collections tend to be fairly filled out:
http://www.archive.org/details/prelinger
http://www.archive.org/details/academic_films
http://www.archive.org/details/computerchronicles


Google Social Graph API, privacy and the public record

I’m digesting some of the reactions to Google’s recently announced Social Graph API. ReadWriteWeb ask whether this is a creeping privacy violation, and danah boyd has a thoughtful post raising concerns about whether the privileged tech elite have any right to experiment in this way with the online lives of those who are lack status, knowledge of these obscure technologies, and who may be amongst the more vulnerable users of the social Web.

While I tend to agree with Tim O’Reilly that privacy by obscurity is dead, I’m not of the “privacy is dead, get over it” school of thought. Tim argues,

The counter-argument is that all this data is available anyway, and that by making it more visible, we raise people’s awareness and ultimately their behavior. I’m in the latter camp. It’s a lot like the evolutionary value of pain. Search creates feedback loops that allow us to learn from and modify our behavior. A false sense of security helps bad actors more than tools that make information more visible.

There’s a danger here of technologists seeming to blame those we’re causing pain for. As danah says, “Think about whistle blowers, women or queer folk in repressive societies, journalists, etc.”. Not everyone knows their DTD from their TCP, or understand anything of how search engines, HTML or hyperlinks work. And many folk have more urgent things to focus on than learning such obscurities, let alone understanding the practical privacy, safety and reputation-related implications of their technology-mediated deeds.

Web technologists have responsibilities to the users of the Web, and while media education and literacy are important, those who are shaping and re-shaping the Web ought to be spending serious time on a daily basis struggling to come up with better ways of allowing humans to act and interact online without other parties snooping. The end of privacy by obscurity should not mean the death of privacy.

Privacy is not dead, and we will not get over it.

But it does need to be understood in the context of the public record. The reason I am enthusiastic about the Google work is that it shines a big bright light on the things people are currently putting into the public record. And it does so in a way that should allow people to build better online environments for those who do want their public actions visible, while providing immediate – and sometimes painful – feedback to those who have over-exposed themselves in the Web, and wish to backpedal.

I hope Google can put a user support mechanism on this. I know from our experience in the FOAF community, even with small scale and obscure aggregators, people will find themselves and demand to be “taken down”. While any particular aggregator can remove or hide such data, unless the data is tracked back to its source, it’ll crop up elsewhere in the Web.

I think the argument that FOAF and XFN are particularly special here is a big mistake. Web technologies used correctly (posh – “plain old semantic html” in microformats-speak) already facilitate such techniques. And Google is far from the only search engine in existence. Short of obfuscating all text inside images, personal data from these sites is readily harvestable.

ReadWriteWeb comment:

None the less, apparently the absence of XFN/FOAF data in your social network is no assurance that it won’t be pulled into the new Google API, either. The Google API page says “we currently index the public Web for XHTML Friends Network (XFN), Friend of a Friend (FOAF) markup and other publicly declared connections.” In other words, it’s not opt-in by even publishers – they aren’t required to make their information available in marked-up code.

The Web itself is built from marked-up code, and this is a thing of huge benefit to humanity. Both microformats and the Semantic Web community share the perspective that the Web’s core technologies (HTML, XHTML, XML, URIs) are properly consumed both by machines and by humans, and that any efforts to create documents that are usable only by (certain fortunate) humans is anti-social and discriminatory.

The Web Accessibility movement have worked incredibly hard over many years to encourage Web designers to create well marked up pages, where the meaning of the content is as mechanically evident as possible. The more evident the meaning of a document, the easier it is to repurpose it or present it through alternate means. This goal of device-independent, well marked up Web content is one that unites the accessibility, Mobile Web, Web 2.0, microformat and Semantic Web efforts. Perhaps the most obvious case is for blind and partially sighted users, but good markup can also benefit those with the inability to use a mouse or keyboard. Beyond accessibility, many millions of Web users (many poor, and in poor countries) will have access to the Web only via mobile phones. My former employer W3C has just published a draft document, “Experiences Shared by People with Disabilities and by People Using Mobile Devices”. Last month in Bangalore, W3C held a Workshop on the Mobile Web in Developing Countries (see executive summary).

I read both Tim’s post, and danah’s post, and I agree with large parts of what they’re both saying. But not quite with either of them, so all I can think to do is spell out some of my perhaps previously unarticulated assumptions.

  • There is no huge difference in principle between “normal” HTML Web pages and XFN or FOAF. Textual markup is what the Web is built from.
  • FOAF and XFN take some of the guesswork out of interpreting markup. But other technologies (javascript, perl, XSLT/GRDDL) can also transform vague markup into more machine-friendly markup. FOAF/XFN simply make this process easier and less heuristic, less error prone.
  • Google was not the first search engine, it is not the only search engine, and it will not be the last search engine. To obsess on Google’s behaviour here is to mistake Google for the Web.
  • Deeds that are on the public record in the Web may come to light months or years later; Google’s opening up of the (already public, but fragmented) Usenet historical record is a good example here.
  • Arguing against good markup practice on the Web (accessible, device independent markup) is something that may hurt underprivileged users (with disabilities, or limited access via mobile, high bandwidth costs etc).
  • Good markup allows content to be automatically summarised and re-presented to suit a variety of means of interaction and navigation (eg. voice browsers, screen readers, small screens, non-mouse navigation etc).
  • Good markup also makes it possible for search engines, crawlers and aggregators to offer richer services.

The difference between Google crawling FOAF/XFN from LiveJournal, versus extracting similar information via custom scripts from MySpace, is interesting and important solely to geeks. Mainstream users have no idea of such distinctions. When LiveJournal originally launched their FOAF files in 2004, the rule they followed was a pretty sensible one: if the information was there in the HTML pages, they’d also expose it in FOAF.

We need to be careful of taking a ruthless “you can’t make an omelete without breaking eggs” line here. Whatever we do, people will suffer. If the Web is made inaccessible, with information hidden inside image files or otherwise obfuscated, we exclude a huge constituency of users. If we shine a light on the public record, as Google have done, we’ll embarass, expose and even potentially risk harm to the people described by these interlinked documents. And if we stick our head in the sand and pretend that these folk aren’t exposed, I predict this will come back to bite us in the butt in a few months or years, since all that data is out there, being crawled, indexed and analysed by parties other than Google. Parties with less to lose, and more to gain.

So what to do? I think several activities need to happen in parallel:

  • Best practice codes for those who expose, and those who aggregate, social Web data
  • Improved media literacy education for those who are unwittingly exposing too much of themselves online
  • Technology development around decentralised, non-public record communication and community tools (eg. via Jabber/XMPP)

Any search engine at all, today, is capable of supporting the following bit of mischief:

Take some starting point a collection of user profiles on a public site. Extract all the usernames. Find the ones that appear in the Web less than say 10,000 times, and on other sites. Assume these are unique userIDs and crawl the pages they appear in, do some heuristic name matching, … and you’ll have a pile of smushed identities, perhaps linking professional and dating sites, or drunken college photos to respectable-new-life. No FOAF needed.

The answer I think isn’t to beat up on the aggregators, it’s to improve the Web experience such that people can have real privacy when they need it, rather than the misleading illusion of privacy. This isn’t going to be easy, but I don’t see a credible alternative.