‘Republic of Letters’ in R / Custom Widgets for Second Screen TV navigation trails

As ever, I write one post that perhaps should’ve been two. This is about the use and linking of datasets that aid ‘second screen’ (smartphone, tablet) TV remotes, and it takes as a quick example a navigation widget and underlying dataset that show us how we might expect to navigate TV archives, in some future age when TV lives more fully in the World Wide Web. I argue that access to the ‘raw data‘ and frameworks for embedding visualisation apps are of equal importance when thinking about innovative ways of exploring the ever-growing archives. All of this comes from many discussions with my NoTube colleagues and other collaborators; rambling scribblyness is all my own.

Ben Hammersley points us at a lovely Flash visualization http://www.stanford.edu/group/toolingup/rplviz/”>Mapping the Republic of Letters”.

From the YouTube overview, “Researchers map thousands of letters exchanged in the 18th century’s “Republic of Letters” and learn at a glance what it once took a lifetime of study to comprehend.”


Mapping the Republic of Letters has at its center a multidimensional data set which spans 300 years and nearly 100,000 letters. We use computing tools that help us to measure and analyze data quantitatively, though that will not take us to our goal. While we use software and computing techniques that were designed for scientific and statistical methods, we are seeking to develop computing tools to enhance humanistic methods, to help us to explore qualitative aspects of the Republic of Letters. The subject of our study and the nature of the material require it. The collections of correspondence and records of travel from this period are incomplete. Of that incomplete material only a fraction has been digitized and is available to us. Making connections and resolving ambiguities in the data is something that can only be done with the help of computing, but cannot be done by computing alone. (from ‘methods and philosophy‘)


screenshot of Republic of Letters app, showing social network links superimposed on map of historical western Europe


See their detailed writeup for more on this fascinating and quite beautiful work. As I’m working lately on linking TV content more deeply into the Web, and on ‘second screen’ navigation, this struck me as just the kind of interface which it ought to be possible to re-use on a tablet PC to explore TV archives. Forgetting for the moment difficulties with Flash on iPads and so on, the idea roughly is that it would be great to embed such a visualization within a TV watching environment, such that when the ‘republic of letters’ widget is focussed on some person, place, or topic, we should have the opportunity to scan the available TV archives for related materials to show.

So a glance at Chrome’s ‘developer tools’ panel gave me a link to the underlying data used by the visualisation. I don’t know exactly whose it is, nor how they want it used, so please treat it with respect. Still, there it is, sat in the Web, in tab-separated format, begging to be used. There’s a lot you can do with the Flash application that I’ve barely touched, but I’m intrigued by the underlying dataset. In particular, where they have the string “Tonson, Jacob”, the data linker in me wants to see a Wikipedia or DBpedia link, since they provide explanation, context, related people, places and themes; all precious assets when trying to scrape together related TV materials to inform, educate or entertain someone with. From a few test searches, it turns out that (many? most?) the correspondents are quite easily matched to Wikipedia: William Congreve, Montagu, 1st earl of Halifax, CharlesHough, bishop of Worcester, John; Stanyan, Abraham;  … Voltaire and others. But what about the data?

Lately I’ve been learning just a little about R, a language used mainly for statistics and related analysis. Here’s what it’ll do ‘out of the box’, in untrained hands:

letters<-read.csv('data.txt',sep='\t', header=TRUE)
v_author = letters$Author=="Voltaire"
v_letters = letters[v_author, ]
Where were Voltaire’s letters sent?
> cbind(summary(v_letters$dest_country))
[,1]
Austria            2
Belgium            6
Canada             0
Denmark            0
England           26
France          1312
Germany           97
India              0
Ireland            0
Italy             68
Netherlands       22
Portugal           0
Russia             5
Scotland           0
Spain              1
Sweden             0
Switzerland      342
The Netherlands    1
Turkey             0
United States      0
Wales              0
As the overview and video in the ‘Republic of Letters‘ site points out (“Tracking 18th-century “social network” through letters”), the patterns of correspondence eg. between Voltaire and e.g. England, Scotland and Ireland jumps out of the data (and more so its visualisation). There are countless ways this information could be explored, presented, sliced-and-diced. Only a custom app can really make the most of it, and the Republic of Letters work goes a long way in that direction. They also note that
The requirements of our project are very much in sync with current work being done in the linked-data/ semantic web community and in the data visualization community, which is why collaboration with computer science has been critical to our project from the start.
So the raw data in the Web here is a simple table; while we could spend time arguing about whether it would better be expressed in JSON, XML or an RDF notation, I’d rather see some discussion around what we can do with this information. In particular, I’m intrigued by the possibilities of R alongside the data-linking habits that come with RDF. If anyone manages to tease anything interesting from this dataset, perhaps mixed in with DBpedia, do post your results.
And of course there are always other datasets to examine; for example see the Darwin correspondence archives, or the Open Knowledge Foundation’s Open Correspondence project which has a Dickens-based pilot. While it is wonderful having UI that is tuned to the particulars of some dataset, it is also great when we can re-use UI code to explore similarly structured data from elsewhere. On both the data side and the UI side, this is expensive, tough work to do well. My current concern is to maximise re-use of both UI and data for the particular circumstances of second-screen TV navigation, a scenario rarely a first priority for anyone!
My hope is that custom navigation widgets for this sort of data will be natural components of next-generation TV remote controls, and that TV archives (and other collections) will open up enough of their metadata to draw in (possibly paying) viewers. To achieve this, we need the raw data on both sides to be as connectable as possible, so that application authors can spend their time thinking about what their users really need and can use, rather than on whether they’ve got the ‘right’ Henry Newton.
If we get it right, there’s a central role for librarianship and archivists in curating the public, linked datasets that tell us about the people, places and topics that will allow us to make new navigation trails through Web-connected television, literature and encyclopedia content. And we’ll also see new roles for custom visualizations, once we figure out an embedding framework for TV widgets that lets them communicate with a display system, with other users in the same room or community, and that is designed for cross-referencing datasets that talk about the same entities, topics, places etc.
As I mentioned regarding Lonclass and UDC, collaboration around open shared data often takes place in a furtive atmosphere of guilt and uncertainty. Is it OK to point to the underlying data behind a fantastic visualisation? How can we make sure the hard work that goes into that data curation is acknowledged and rewarded, even while its results flow more freely around the Web, and end up in places (your TV remote!) that may never have been anticipated?

Local Video for Local People

OK it’s all Google stuff, but still good to see. Go to Google Maps, My Maps, to find ‘Videos from YouTube’ listed. Here’s where I used to live (Bristol UK) and where I live now (Amsterdam, The Netherlands). Here’s a promo film of some nearby art installations from ArtZuid, who even have a page in English. I wouldn’t have found the video or the nearby links except through the map overlay. I don’t know exactly how they’re geotagging the videos, I can’t see an option under ‘my videos’ in YouTube, so perhaps it’s automatic or viewer annotations. In YouTube, you can add a map link under ‘My Videos’ / ‘Edit Video'; I didn’t see that initially. I made some investigations into similar issues (videos on maps) while at Joost; see brief mention in my Fundamentos Web slides from a couple of years ago.
Oh, nearly forgot to mention: zooming out to get a Europe or World-wide view is quite striking too.

Family trees, Gedcom::FOAF in CPAN, and provenance

Every wondered who the mother(s) of Adam and Eve’s grand-children were? Me too. But don’t expect SPARQL or the Semantic Web to answer that one! Meanwhile, …

You might nevetheless care to try the Gedcom::FOAF CPAN module from Brian Cassidy. It can read Gedcom, a popular ‘family history’ file format, and turn it into RDF (using FOAF and the relationship and biography vocabularies). A handy tool that can open up a lot of data to SPARQL querying.

The Gedcom::FOAF API seems to focus on turning the people or family Gedcom entries  into their own FOAF XML files. I wrote a quick and horrid Perl script that runs over a Gedcom file and emits a single flattened RDF/XML document. While URIs for non-existent XML files are generated, this isn’t a huge problem.

Perhaps someone would care to take a look at this code and see whether a more RDFa and linked-data script would be useful?

Usage: perl gedcom2foafdump.pl BUELL001.GED > _sample_gedfoaf.rdf

The sample data I tested it on is intriguing, though I’ve not really looked around it yet.

It contains over 9800 people including the complete royal lines of England, France, Spain and the partial royal lines of almost all other European countries. It also includes 19 United States Presidents descended from royalty, including Washington, both Roosevelts, Bush, Jefferson, Nixon and others. It also has such famous people as Brigham Young, William Bradford, Napoleon Bonaparte, Winston Churchill, Anne Bradstreet (Dudley), Jesus Christ, Daniel Boone, King Arthur, Jefferson Davis, Brian Boru King of Ireland, and others. It goes all the way back to Adam and Eve and also includes lines to ancient Rome including Constantine the Great and ancient Egypt including King Tutankhamen (Tut).

The data is credited to Matt & Ellie Buell, “Uploaded By: Eochaid”, 1995-05-25.

Here’s an extract to give an idea of the Gedcom form:

0 @I4961@ INDI
1 NAME Adam //
1 SEX M
1 REFN +
1 BIRT
2 DATE ABT 4000 BC
2 PLAC Eden
1 DEAT
2 DATE ABT 3070 BC
1 FAMS @F2398@
1 NOTE He was the first human on Earth.
1 SOUR Genesis 2:20 KJV
0 @I4962@ INDI
1 NAME Eve //
1 SEX F
1 REFN +
1 BIRT
2 DATE ABT 4000 BC
2 PLAC Eden
1 FAMS @F2398@
1 SOUR Genesis 3:20 KJV

It might not directly answer the great questions of biblical scholarship, but it could be a fun dataset to explore Gedcom / RDF mappings with. I wonder how it compares with Freebase, DBpedia etc.

The Perl module is a good start for experimentation but it only really scratches the surface of the problem of representing source/provenance and uncertainty. On which topic, Jeni Tennison has a post from a year ago that’s well worth (re-)reading.

What I’ve done in the above little Perl script is implement a simplification: instead of each family description being its own separate XML file, they are all squashed into a big flat set of triples (‘graph’). This may or may not be appropriate, depending on the sourcing of the records. It seems Gedcom offers some basic notion of ‘source’, although not one expressed in terms of URIs. If I look in the SOUR(ce) field in the Gedcom file, I see information like this (which currently seems to be ignored in the Gedcom::FOAF mapping):

grep SOUR BUELL001.GED | sort | uniq

1 NOTE !SOURCE:Burford Genealogy, Page 102 Cause of Death; Hemorrage of brain
1 NOTE !SOURCE:Gertrude Miller letter “Harvey Lee lived almost 1 year. He weighed
1 NOTE !SOURCE:Gertrude Miller letter “Lynn died of a ruptured appendix.”
1 NOTE !SOURCE:Gertrude Miller letter “Vivian died of a tubal pregnancy.”
1 SOUR “Castles” Game Manuel by Interplay Productions
1 SOUR “Mayflower Descendants and Their Marriages” pub in 1922 by Bureau of
1 SOUR “Prominent Families of North Jutland” Pub. in Logstor, Denmark. About 1950
1 SOUR /*- TUT
1 SOUR 273
1 SOUR AHamlin777.  E-Mail “Descendents of some guy
1 SOUR Blundell, Sherrie Lea (Slingerland).  information provided on 16 Apr 1995
1 SOUR Blundell, William, Rev. Interview on Jan 29, 1995.
1 SOUR Bogert, Theodore. AOL user “TedLBJ” File uploaded to American Online
1 SOUR Buell, Barbara Jo (Slingerland)
1 SOUR Buell, Beverly Anne (Wenge)
1 SOUR Buell, Beverly Anne (Wenge).  letter addressed to Kim & Barb Buell dated
1 SOUR Buell, Kimberly James.
1 SOUR Buell, Matthew James. written December 19, 1994.
1 SOUR Burnham, Crystal (Harris).  Leter sent to Matt J. Buell on Mar 18, 1995.
1 SOUR Burnham, Crystal Colleen (Harris).  AOL user CBURN1127.  E-mail “Re: [...etc.]

Some of these sources could be tied to cleaner IDs (eg. for books c/o Open Library, although see ‘in search of cultural identifiers‘ from Michael Smethurst).

I believe RDF’s SPARQL language gives us a useful tool (the notion of ‘GRAPH’) that can be applied here, but we’re a long way from having worked out the details when it comes to attaching evidence to claims. So for now, we in the RDF scene have a fairly course-grained approach to data provenance. Databases are organized into batches of triples, ie. RDF statements that claim something about the world. And while we can use these batches – aka graphs – in our queries, we haven’t really figured out what kind of information we want to associate with them yet. Which is a pity, since this could have uses well beyond family history, for example to online journalistic practices and blog-mediated fact checking.

Nearby in the Web: see also the SIOC/SWAN telecons, a collaboration in the W3C SemWeb lifescience community around the topic of modelling scientific discourse.

Apparently the UK government are revisiting the idea of net censorship, in the context of anti-terrorism.

UK Home Secretary Jacqui Smith as reported in the “Guardian, Government targets extremist websites“:

Speaking to the BBC’s Radio 4 Today programme before her speech, Smith said there were specific examples of websites that “clearly fall under the category of gratifying terrorism”. “There is growing evidence people may be using the internet both to spread messages and to plan specifically for terrorism,” she said. “That is why, as well as changing the law to make sure we can tackle that, there is more we need to do to show the internet is not a no-go area as far as tackling terrorism is concerned.”

This could go really wrong, really fast. Will we be allowed to read Bin Laden texts online? Hitler, Stalin? Talk to people who sympathise with organizations deemed terroristic? Who live in countries in the ‘axis of evil’? Doubtless the first sites to be targetted will be the most outrageous, but we’re on a slippery slope here.

It’s pretty much impossible to stop the online radicalisation of angry young men. But driving that process underground, and criminalising anyone on the fringes of the scene, will make it all the harder for calm voices and nuanced opinions to be heard. ‘Us and them’ is exactly what we don’t need right now.

British Board of Film Classification RSS feeds and Movie metadata

The BBFC have several RSS feeds on their site, carrying information about their judgements on various cinematic works for a UK audience. Recent film decisions, recent adult (sex) videos and films, etc. Each entry in the feed points to a descriptive page and summarises a BBFC judgement in a simple textual description, eg. “The BFC gave the English language video LES PERVERSIONS 5 a rating of R18 on Thu, 10 Feb. Consumer advice is not supplied for R18 titles. the video is directed by Sineplex.“.

While their adult feed is interesting in the context of the debates around Web filtering etc., the mainstream feed is also interesting. It has textual information about sex, violence, drugs etc., which could easily be exposed in machine-processable form if they’d used RSS 1.0 + IRCA/RDF labels. Both make the semantic web point about data-reuse – since they can be used for finding things as much as for not finding things.

The BBFC gave the English language film TABLOID a rating of 18 on Fri, 28 Jan. This film contains STRONG SEX, VIOLENCE, LANGUAGE AND DRUG USE. The film is directed by David Blair. The cast includes Matthew Rhys, Mary Elizabeth Mastrantonio, David Soul, John Hurt, Stephen Tompkinson, Art Malik, Dani Behr, Keith Chegwin, Ainsley Harriott, Gail Porter, Beverley Callard, Les Dennis, Danny Dyer, James Hewitt, Freddie Jones, Vicky Holloway, Vikki Thomas and Anna Kumble.

I’ve been thinking about how FOAF could better support recommendation systems, eg. around MusicBrainz for music, or systems like MindSwap’s FilmTrust for movies. For movies, one core issue is quite simple: providing unique identifiers for films (direct or indirect, eg. via a page that has some film as it’s primary topic). BBFC or IMDB pages, or movie homepages, could serve such a purpose. Unfortunately, the world of movies doesn’t yet have a good open-content licensed database, unlike music, where we have MusicBrainz. Until we agree on some tricks for identifying things like movies (and actors, …), we won’t get the data integration needed to have a really rich Web-wide movie review system.

We will eventually, I am sure, see a framework in which various sites aggregate and syndicate such opinions, either numerical ratings or (more likely I think) textual reviews. Often I’m quite interested to see how a movie was perceived by people I disagree with, or have never met. The CapAlert site is often entertaining, for example. All these sources (as well as smaller community datasets) will be mixed together in a metadata marketplace. Information that some people use for filtering, blocking and avoiding will be used by others for searching, browsing and discovery. It’s just a matter of time before we’ll be using W3C’s new SPARQL technology to query BBFC judgement feeds, FOAF+review data from sites like like Filmtrust and other weblog-based data sources… Anyhow, definitely check out the Filmtrust site if you’re interested in movie metadata and ratings.