and One Hundred Years of Search

A talk from London SemWeb meetup hosted by the BBC Academy in London, Mar 30 2012….

Slides and video are already in the Web, but I wanted to post this as an excuse to plug the new Web History Community Group that Max and I have just started at W3C. The talk was part of the Libraries, Media and the Semantic Web meetup hosted by the BBC in March. It gave an opportunity to run through some forgotten history, linking Paul Otlet, the Universal Decimal Classification, and some 100 year old search logs from Otlet’s Mundaneum. Having worked with the BBC Lonclass system (a descendant of Otlet’s UDC), and collaborated with the Aida Slavic of the UDC on their publication of Linked Data, I was happy to be given the chance to try to spell out these hidden connections. It also turned out that Google colleagues have been working to support the Mundaneum and the memory of this early work, and I’m happy that the talk led to discussions with both the Mundaneum and Computer History Museum about the new Web History group at W3C.

So, everything’s connected. Many thanks to W. Boyd Rayward (Otlet’s biographer) for sharing the ancient logs that inspired the talk (see slides/video for a few more details). I hope we can find more such things to share in the Web History group, because the history of the Web didn’t begin with the Web…

Linked Literature, Linked TV – Everything Looks like a Graph


Ben Fry in ‘Visualizing Data‘:

Graphs can be a powerful way to represent relationships between data, but they are also a very abstract concept, which means that they run the danger of meaning something only to the creator of the graph. Often, simply showing the structure of the data says very little about what it actually means, even though it’s a perfectly accurate means of representing the data. Everything looks like a graph, but almost nothing should ever be drawn as one.

There is a tendency when using graphs to become smitten with one’s own data. Even though a graph of a few hundred nodes quickly becomes unreadable, it is often satisfying for the creator because the resulting figure is elegant and complex and may be subjectively beautiful, and the notion that the creator’s data is “complex” fits just fine with the creator’s own interpretation of it. Graphs have a tendency of making a data set look sophisticated and important, without having solved the problem of enlightening the viewer.


Ben Fry is entirely correct.

I suggest two excuses for this indulgence: if the visuals are meaningful only to the creator of the graph, then let’s make everyone a graph curator. And if the things the data attempts to describe — for example, 14 million books and the world they in turn describe — are complex and beautiful and under-appreciated in their complexity and interconnectedness, … then perhaps it is ok to indulge ourselves. When do graphs become maps?

I report here on some experiments that stem from two collaborations around Linked Data. All the visuals in the post are views of bibliographic data, based on similarity measures derrived from book / subject keyword associations, with visualization and a little additional analysis using Gephi. Click-through to Flickr to see larger versions of any image. You can’t always see the inter-node links, but the presentation is based on graph layout tools.

Firstly, in my ongoing work in the NoTube project, we have been working with TV-related data, ranging from ‘social Web’ activity streams, user profiles, TV archive catalogues and classification systems like Lonclass. Secondly, over the summer I have been working with the Library Innovation Lab at Harvard, looking at ways of opening up bibliographic catalogues to the Web as Linked Data, and at ways of cross-linking Web materials (e.g. video materials) to a Webbified notion of ‘bookshelf‘.

In NoTube we have been making use of the Apache Mahout toolkit, which provided us with software for collaborative filtering recommendations, clustering and automatic classification. We’ve barely scratched the surface of what it can do, but here show some initial results applying Mahout to a 100,000 record subset of Harvard’s 14 million entry catalogue. Mahout is built to scale, and the experiments here use datasets that are tiny from Mahout’s perspective.


In NoTube, we used Mahout to compute similarity measures between each pair of items in a catalogue of BBC TV programmes for which we had privileged access to subjective viewer ratings. This was a sparse matrix of around 20,000 viewers, 12,500 broadcast items, with around 1.2 million ratings linking viewer to item. From these, after a few rather-too-casual tests using Mahout’s evaluation measure system, we picked its most promising similarity measure for our data (LogLikelihoodSimilarity or Tanimoto), and then for the most similar items, simply dumped out a huge data file that contained pairs of item numbers, plus a weight.

There are many many smarter things we could’ve tried, but in the spirit of ‘minimal viable product‘, we didn’t try them yet. These include making use of additional metadata published by the BBC in RDF, so we can help out Mahout by letting it know that when Alice loves item_62 and Bob loves item_82127, we also via RDF also knew that they are both in the same TV series and Brand. Why use fancy machine learning to rediscover things we already know, and that have been shared in the Web as data? We could make smarter use of metadata here. Secondly we could have used data-derrived or publisher-supplied metadata to explore whether different Mahout techniques work better for different segments of the content (factual vs fiction) or even, as we have also some demographic data, different groups of users.


Anyway, Mahout gave us item-to-item similarity measures for TV. Libby has written already about how we used these in ‘second screen’ (or ‘N-th’ screen, aka N-Screen) prototypes showing the impact that new Web standards might make on tired and outdated notions of “TV remote control”.

What if your remote control could personalise a view of some content collection? What if it could show you similar things based on your viewing behavior, and that of others? What if you could explore the ever-growing space of TV content using simple drag-and-drop metaphors, sending items to your TV or to your friends with simple tablet-based interfaces?


So that’s what we’ve been up to in NoTube. There are prototypes using BBC content (sadly not viewable by everyone due to rights restrictions), but also some experiments with TV materials from the Internet Archive, and some explorations that look at TED’s video collection as an example of Web-based content that (via and YouTube) are more generally viewable. Since every item in the BBC’s Archive is catalogued using a library-based classification system (Lonclass, itself based on UDC) the topic of cross-referencing books and TV has cropped up a few times.


Meanwhile, in (the digital Public Library of) America, … the Harvard Library Innovation Lab team have a huge and fantastic dataset describing 14 million bibliographic records. I’m not sure exactly how many are ‘books'; libraries hold all kinds of objects these days. With the Harvard folk I’ve been trying to help figure out how we could cross-reference their records with other “Webby” sources, such as online video materials. Again using TED as an example, because it is high quality but with very different metadata from the library records. So we’ve been looking at various tricks and techniques that could help us associate book records with those. So for example, we can find tags for their videos on the TED site, but also on delicious, and on youtube. However taggers and librarians tend to describe things quite differently. Tags like “todo”, “inspirational”, “design”, “development” or “science” don’t help us pin-point the exact library shelf where a viewer might go to read more on the topic. Or conversely, they don’t help the library sites understand where within their online catalogues they could embed useful and engaging “related link” pointers off to or YouTube.

So we turned to other sources. Matching TED speaker names against Wikipedia allows us to find more information about many TED speakers. For example the Tim Berners-Lee entry, which in its Linked Data form helpfully tells us that this TED speaker is in the categories ‘Japan_Prize_laureates’, ‘English_inventors’, ‘1955_births’, ‘Internet_pioneers’. All good to know, but it’s hard to tell which categories tell us most about our speaker or video. At least now we’re in the Linked Data space, we can navigate around to Freebase, VIAF and a growing Web of data-sources. It should be possible at least to associate TimBL’s TED talks with library records for his book (so we annotate one bibliographic entry, from 14 million! …can’t we map areas, not items?).


Can we do better? What if we also associated Tim’s two TED talk videos with other things in the library that had the same subject classifications or keywords as his book? What if we could build links between the two collections based not only on published authorship, but on topical information (tags, full text analysis of TED talk transcripts). Can we plan for a world where libraries have access not only to MARC records, but also full text of each of millions of books?


I’ve been exploring some of these ideas with David Weinberger, Paul Deschner and Matt Phillips at Harvard, and in NoTube with Libby Miller, Vicky Buser and others.


Yesterday I took the time to make some visual sanity check of the bibliographic data as processed into a ‘similarity space’ in some Mahout experiments. This is a messy first pass at everything, but I figured it is better to blog something and look for collaborations and feedback, than to chase perfection. For me, the big story is in linking TV materials to the gigantic back-story of context, discussion and debate curated by the world’s libraries. If we can imagine a view of our TV content catalogues, and our libraries, as visual maps, with items clustered by similarity, then NoTube has shown that we can build these into the smartphones and tablets that are increasingly being used as TV remote controls.


And if the device you’re using to pause/play/stop or rewind your TV also has access to these vast archives as they open up as Linked Data (as well as GPS location data and your Facebook password), all kinds of possibilities arise for linked, annotated and fact-checked TV, as well as for showing a path for libraries to continue to serve as maps of the entertainment, intellectual and scientific terrain around us.


A brief technical description. Everything you see here was made with Gephi, Mahout and experimental data from the Library Innovation Lab at Harvard, plus a few scripts to glue it all together.

Mahout was given 100,000 extracts from the Harvard collection. Just main and sub-title, a local ID, and a list of topical phrases (mostly drawn from Library of Congress Subject Headings, with some local extensions). I don’t do anything clever with these or their sub-structure or their library-documented inter-relationships. They are treated as atomic codes, and flattened into long pseudo-words such as ‘occupational_diseases_prevention_control’ or ‘french_literature_16th_century_history_and_criticism’,
‘motion_pictures_political_aspects’, ‘songs_high_voice_with_lute’, ‘dance_music_czechoslovakia’, ‘communism_and_culture_soviet_union’. All of human life is there.

David Weinberger has been calling this gigantic scope our problem of the ‘Taxonomy of Everything’, and the label fits. By mushing phrases into fake words, I get to re-use some Mahout tools and avoid writing code. The result is a matrix of 100,000 bibliographic entities, by 27684 unique topical codes. Initially I made the simple test of feeding this as input to Mahout’s K-Means clustering implementation. Manually inspecting the most popular topical codes for each cluster (both where k=12 to put all books in 12 clusters, or k=1000 for more fine-grained groupings), I was impressed by the initial results.


I only have these in crude text-file form. See hv/_k1000.txt and hv/_twelve.txt (plus dictionary, see big file
_harv_dict.txt ).

For example, in the 1000-cluster version, we get: ‘medical_policy_united_states’, ‘health_care_reform_united_states’, ‘health_policy_united_states’, ‘medical_care_united_states’,
‘delivery_of_health_care_united_states’, ‘medical_economics_united_states’, ‘politics_united_states’, ‘health_services_accessibility_united_states’, ‘insurance_health_united_states’, ‘economics_medical_united_states’.

Or another cluster: ‘brain_physiology’, ‘biological_rhythms’, ‘oscillations’.

How about: ‘museums_collection_management’, ‘museums_history’, ‘archives’, ‘museums_acquisitions’, ‘collectors_and_collecting_history’?

Another, conceptually nearby (but that proximity isn’t visible through this simple clustering approach), ‘art_thefts’, ‘theft_from_museums’, ‘archaeological_thefts’, ‘art_museums’, ‘cultural_property_protection_law_and_legislation’, …

Ok, I am cherry picking. There is some nonsense in there too, but suprisingly little. And probably some associations that might cause offense. But it shows that the tooling is capable (by looking at book/topic associations) at picking out similarities that are significant. Maybe all of this is also available in LCSH SKOS form already, but I doubt it. (A side-goal here is to publish these clusters for re-use elsewhere…).


So, what if we take this, and instead compute (a bit like we did in NoTube from ratings data) similarity measures between books?


I tried that, without using much of Mahout’s sophistication. I used its ‘rowsimilarityjob’ facility and generated similarity measures for each book, then threw out most of the similarities except the top 5, later the top 3, from each book. From this point, I moved things over into the Gephi toolkit (“photoshop for graphs”), as I wanted to see how things looked.


First results shown here. Nodes are books, links are strong similarity measures. Node labels are titles, or sometimes title + subtitle. Some (the black-background ones) use Gephi’s “modularity detection” analysis of the link graph. Others (white background) I imported the 1000 clusters from the earlier Mahout experiments. I tried various of the metrics in Gephi and mapped these to node size. This might fairly be called ‘playing around’ at this stage, but there is at least a pipeline from raw data (eventually Linked Data I hope) through Mahout to Gephi and some visual maps of literature.


What does all this show?

That if we can find a way to open up bibliographic datasets, there are solid opensource tools out there that can give new ways of exploring the items described in the data. That those tools (e.g. Mahout, Gephi) provide many different ways of computing similarity, clustering, and presenting. There is no single ‘right answer’ for how to present literature or TV archive content as a visual map, clustering “like with like”, or arranging neighbourhoods. And there is also no restriction that we must work dataset-by-dataset, either. Why not use what we know from movie/TV recommendations to arrange the similarity space for books? Or vice-versa?

I must emphasise (to return to Ben Fry’s opening remark) that this is a proof-of-concept. It shows some potential, but it is neither a user interface, nor particularly informative. Gephi as a tool for making such visualizations is powerful, but it too is not a viable interface for navigating TV content. However these tools do give us a glimpse of what is hidden in giant and dull-sounding databases, and some hints for how patterns extracted from these collections could help guide us through literature, TV or more.

Next steps? There are many things that could be tried; more than I could attempt. I’d like to get some variant of these 2D maps onto ipad/android tablets, loaded with TV content. I’d like to continue exploring the bridges between content (eg. TED) and library materials, on tablets and PCs. I’d like to look at Mahout’s “collocated terms” extraction tools in more details. These allow us to pull out recurring phrases (e.g. “Zero Sum”, “climate change”, “golden rule”, “high school”, “black holes” were found in TED transcripts). I’ve also tried extracting bi-gram phrases from book titles using the same utility. Such tools offer some prospect of bulk-creating links not just between single items in collections, but between neighbourhood regions in maps such as those shown here. The cross-links will never be perfect, but then what’s a little serendipity between friends?

As full text access to book data looms, and TV archives are finding their way online, we’ll need to find ways of combining user interface, bibliographic and data science skills if we’re really going to make the most of the treasures that are being shared in the Web. Since I’ve only fragments of each, I’m always drawn back to think of this in terms of collaborative work.

A few years ago, Netflix had the vision and cash to pretty much buy the attention of the entire machine learning community for a measly million dollars. Researchers love to have substantive datasets to work with, and the (now retracted) Netflix dataset is still widely sought after. Without a budget to match Netflix’, could we still somehow offer prizes to help get such attention directed towards analysis and exploitation of linked TV and library data? We could offer free access to the world’s literature via a global network of libraries? Except everyone gets that for free already. Maybe we don’t need prizes.

Nearby in the Web: NoTube N-Screen, Flickr slideshow

Lonclass and RDF

Lonclass is one of the BBC’s in-house classification systems – the “London classification”. I’ve had the privilege of investigating lonclass within the NoTube project. It’s not currently public, but much of what I say here is also applicable to the Universal Decimal Classification (UDC) system upon which it was based. UDC is also not fully public yet; I’ve made a case elsewhere that it should be, and I hope we’ll see that within my lifetime. UDC and Lonclass have a fascinating history and are rich cultural heritage artifacts in their own right, but I’m concerned here only with their role as the keys to many of our digital and real-world archives.

Why would we want to map Lonclass or UDC subject classification codes into RDF?

Lonclass codes can be thought of as compact but potentially complex sentences, built from the thousands of base ‘words’ in the Lonclass dictionary. By mapping the basic pieces, the words, to other data sources, we also enrich the compound sentences. We can’t map all of the sentences as there can be infinitely many of them – it would be an expensive and never-ending task.

For example, we might have a lonclass code for “Report on the environmental impact of the decline of tin mining in Sweden in the 20th century“. This would be an jumble of numbers and punctuation which I won’t trouble you with here. But if we parsed out that structure we can see the complex code as built from primitives such as ‘tin mining’ (itself e.g. ‘Tin’ and ‘Mining’), ‘Sweden’, etc. By linking those identifiable parts to shared Web data, we also learn more about the complex composite codes that use them. Wikipedia’s Sweden entry tells us in English, “Sweden has land borders with Norway to the west and Finland to the northeast, and water borders with Denmark, Germany, and Poland to the south, and Estonia, Latvia, Lithuania, and Russia to the east.”. Increasingly this additional information is available in machine-friendly form. Although right now we can’t learn about Sweden’s borders from the bits of Wikipedia reflected into DBpedia’s Sweden entry, but UN FAO’s geopolitical ontology does have this information and more in RDF form.

There is more, much more, to know about Sweden than can possibly be represented directly within Lonclass or UDC. Yet those facts may also be very useful for the retrieval of information tagged with Sweden-related Lonclass codes. If we map the Lonclass notion of ‘Sweden’ to identified concepts described elsewhere, then whenever we learn more about the latter, we also learn more about the former, and indirectly, about anything tagged with complex lonclass codes using that concept. Suddenly an archived TV documentary tagged as covering a ‘report on the environmental impact of the decline of tin mining in Sweden’ is accessible also to people or machines looking under Scandinavia + metal mining. Environmental matters, after all, often don’t respect geo-political borders; someone searching for coverage of environmental trends in a neighbouring country might well be happy to find this documentary. But should Lonclass or UDC maintain an index of which countries border which others? Surely not!

Lonclass and UDC codes have a rich hidden structure that is rarely exploited with modern tools. Lonclass by virtue of its UDC heritage, does a lot of work itself towards representing complex conceptual inter-relationships. It embodies a conceptual map of our world, with mysterious codes (well known in the library world) for topics such as ‘622 – mining’, but also specifics e.g. ‘622.3 Mining of specific minerals, ores, rocks’, and combinations (‘622.3:553.9 Extraction of carbonaceous minerals, hydrocarbons’). By joining a code for ‘mining a specific mineral…’ to a code for ‘553.9 Deposits of carbonaceous rocks. Hydrocarbon deposits’ we get a compound term. So Lonclass/UDC “knows” about the relationship between “Tin Mining” and “Mining”, “metals” etc., and quite likely between “Sweden” and “Scandinavia”. But it can’t know everything! Sooner or later, we have to say, “Sorry, it’s not reasonable to expect the classification system to model the entire world; that’s a bigger problem”.

Even within the closed, self-supporting universe of UDC/Lonclass, this compositional semantics system is a very powerful tool for describing obscure topics in terms  of well known simpler concepts. But it’s too much for any single organisation (whether the BBC, the UDC Consortium, or anyone) to maintain and extend such a system to cover all of modern life; from social, legal and business developments to new scientific innovations. The work needs to be shared, and RDF is currently our best bet on how to create such work sharing, meaning sharing, information-linking systems in the Web. The hierarchies in UDC and Lonclass don’t attempt to represent all of objective reality; they instead show paths through information.

If the metaphor of a ‘conceptual map’ holds up, then it’s clear that at some point it’s useful to have our maps made by different parties, with different specialised knowledge. The Web now contains a smaller but growing Web of machine readable descriptions. Over at MusicBrainz is a community who take care of describing the entities and relationships that cover much of music, or at least popular music. Others describe countries, species, genetics, languages, historical events, economics, and countless other topics. The data is sometimes messy or an imperfect fit for some task-in-hand, but it is actively growing, curated and connected.

I’m not arguing that Lonclass or UDC should be thrown out and replaced by some vague ‘linked cloud’. Rather, that there are some simple steps that can be taken towards making sure each of these linked datasets contribute to modernising our paths into the archives. We need to document and share opensource tools for an agreed data model for the arcane numeric codes of UDC and Lonclass. We need at least the raw pieces, the simplest codes, to be described for humans and machines in public, stable Web pages, and for their re-use, mapping, data mining and re-combination to be actively encouraged and celebrated. Currently, it is possible to get your hands on this data if you work with the BBC (Lonclass), pay license fees (UDC) or exchange USB sticks with the right party in some shady backstreet. Whether the metaphor of choice is ‘key to the archives’ or ‘conceptual map of…’, this is a deeply unfortunate situation, both for the intrinsic public value of these datasets, but also for the collections they index. There’s a wealth of meaning hidden inside Lonclass and UDC and the collections they index, a lot that can be added by linking it to other RDF datasets, but more importantly there are huge communities out there who’ll do much of the work when the data is finally opened up…

I wrote too much. What I meant to say is simple. Classification systems with compositional semantics can be enriched when we map their basic terms using identifiers from other shared data sets. And those in the UDC/Lonclass tradition, while in some ways they’re showing their age (weird numeric codes, huge monolithic, hard-to-maintain databases), … are also amongst the most interesting systems we have today for navigating information, especially when combined with Linked Data techniques and companion datasets.

Subject classification and Statistics

Subject classification and statistics share some common problems. This post takes a small example discussed at this week’s ODaF event on “Semantic Statistics” in Tilberg, and explores its expression coded in the Universal Decimal Classification (UDC). UDC supports faceted description, providing an abstract grammar allowing sentence-like subject descriptions to be composed from the “raw materials” defined in its vocabulary scheme.

This makes the mapping of UDC (and to some extent also Dewey classifications)  into W3C’s SKOS somewhat lossy, since patterns and conventions for documenting these complex, composed structures are not yet well established. In the NoTube project we are looking into this in a TV context, in large part because the BBC archives make extensive use of UDC via their Lonclass scheme; see my ‘investigating Lonclass‘ and UDC seminar talk for more on those scenarios. Until this week I hadn’t thought enough about the potential for using this to link deep into statistical datasets.

One of the examples discussed on Tuesday was as follows (via Richard Cyganiak):

“There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″

There was much interesting discussion in Tilburg about the proper scope and role of Linked Data techniques for sharing this kind of statistical data. Do we use RDF essentially as metadata, to find ‘black boxes’ full of stats, or do we use RDF to try to capture something of what the statistics are telling us about the world? When do we use RDF as simple factual data directly about the world (eg. school X has N pupils [currently; or at time t]), and when does it become a carrier for raw numeric data whose meaning is not so directly expressed at the factual level?

The state of the art in applying RDF here seems to be SDMX-RDF, see Richard’s slides. The SDMX-RDF work uses SKOS to capture code lists, to describe cross-domain concepts and to indicate subject matter.

Given all this, I thought it would be worth taking this tiny example and looking at how it might look in UDC, both as an example of the ‘compositional semantics’ some of us hope to capture in extended SKOS descriptions, but also to explore scenarios that cross-link numeric data with the bibliographic materials that can be found via library classification techniques such as UDC. So I asked the ever-helpful Aida Slavic (editor in chief of the UDC), who talked me through how this example data item looks from a UDC perspective.

I asked,

So I’ve just got home from a meeting on semweb/stats. These folk encode data values with stuff like “There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″. How much of that could have a UDC coding? I guess I should ask, how would subject index a book whose main topic was “occupational injuries in the Washington DC metro area in 2008″?

Aida’s reply (posted with permission):

You can present all of it & much more using UDC. When you encode a subject like this in UDC you store much more information than your proposed sentence actually contains. So my decision of how to ‘translate this into udc’ would depend on learning more about the actual text and the context of the message it conveys, implied audience/purpose, the field of expertise for which the information in the document may be relevant etc. I would probably wonder whether this is a research report, study, news article, textbook, radio broadcast?

Not knowing more then you said I can play with the following: 331.46(735.215.2/.4)”2008

Accidents at work — Washington metropolitan area — year 2008
or a bit more detailed:  331.46-053.18(735.215.2/.4)”2008
Accidents at work — dead persons – Washington metropolitan area — year 2008
[you can say the number of dead persons but this is not pertinent from point of view of indexing and retrieval]

…or maybe (depending what is in the content and what is the main message of the text) and because you used the expression ‘fatal injuries’ this may imply that this is more health and safety/ prevention area in health hygiene which is in medicine.

The UDC structures composed here are:

TIME “2008”

PLACE (735.215.2/.4)  Counties in the Washington metropolitan area

331     Labour. Employment. Work. Labour economics. Organization of  labour
331.4     Working environment. Workplace design. Occupational safety.  Hygiene at work. Accidents at work
331.46  Accidents at work ==> 614.8

614   Prophylaxis. Public health measures. Preventive treatment
614.8    Accidents. Risks. Hazards. Accident prevention. Persona protection. Safety
614.8.069    Fatal accidents

NB – classification provides a bit more context and is more precise than words when it comes to presenting content i.e. if the content is focused on health and safety regulation and occupation health then the choice of numbers and their order would be different e.g. 614.8.069:331.46-053.18 [relationship between] health & safety policies in prevention of fatal injuries and accidents at work.

So when you read  UDC number 331.46 you do not see only e.g. ‘accidents at work’ but  ==>  ‘accidents at work < occupational health/safety < labour economics, labour organization < economy
and when you see UDC number 614.8  it is not only fatal accidents but rather ==> ‘fatal accidents < accident prevention, safety, hazards < Public health and hygiene. Accident prevention

When you see (735.2….) you do not only see Washington but also United States, North America

So why is this interesting? A couple of reasons…

1. Each of these complex codes combines several different hierarchically organized components; just as they can be used to explore bibliographic materials, similar approaches might be of value for navigating the growing collections of public statistical data. If SKOS is to be extended / improved to better support subject classification structures, we should take care also to consider use cases from the world of statistics and numeric data sharing.

2. Multilingual aspects. There are plans to expose SKOS data for the upper levels of UDC. An HTML interface to this “UDC summary” is already available online, and includes collected translations of textual labels in many languages (see progress report) . For example, we can look up 331.4 and find (in hierarchical context) definitions in English (“Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work”), alongside e.g. Spanish (“Entorno del trabajo. Diseño del lugar de trabajo. Seguridad laboral. Higiene laboral. Accidentes de trabajo”), CroatianArmenian, …

Linked Data is about sharing work; if someone else has gone to the trouble of making such translations, it is probably worth exploring ways of re-using them. Numeric data is (in theory) linguistically neutral; this should make linking to translations particularly attractive. Much of the work around RDF and stats is about providing sufficient context to the raw values to help us understand what is really meant by “66” in some particular dataset. By exploiting SDMX-RDF’s use of SKOS, it should be possible to go further and to link out to the wider literature on workplace fatalities. This kind of topical linking should work in both directions: exploring out from numeric data to related research, debate and findings, but also coming in and finding relevant datasets that are cross-referenced from books, articles and working papers. W3C recently launched a Library Linked Data group, I look forward to learning more about how libraries are thinking about connecting numeric and non-numeric information.

Skosdex progress: basic lucene search

I now have a crude Lucene index derrived from SKOS data. It is more or less a toy example, but somehow promising also.

Example below is a test against FAO‘s AGROVOC. Each concept becomes a “document”, with a “word” field containing the prefLabel, and a “uri” field for the concept URI. I don’t index anything else yet.

The hope here is to have a handy prototyping environment for testing different indexing regimes. The code takes about 4-5 mins to index AGROVOC on my MacBook, running under Jruby.

The data I’m using is a SKOS dump from the FAO Web site, post-processed with “grep -v” to skip the Farsi lines, due to a Unicode error. The transcript below comes from running Lucli, a handy command line tool for Lucene.

Next steps with indexing? Not sure. Probably make sure altLabel is handled. But I’m also curious about possibility of including fields that pull in labels from nearby concepts, so they can be matched in weighted searches. Would be hard to evaluate the effectiveness though.

lucli> search uri:”″
Searching for: uri:”http aims aos agrovoc c_47934″
1 total matching documents
—————- 1 score:1.0———————
word:Pteria hirundo
lucli> search word:”Leiocottus hirundo”
Searching for: word:”leiocottus hirundo”
1 total matching documents
—————- 1 score:1.0———————
word:Leiocottus hirundo
lucli> search word:”hirundo”
Searching for: word:hirundo
2 total matching documents
—————- 1 score:1.0———————
word:Pteria hirundo
—————- 2 score:1.0———————
word:Leiocottus hirundo

Skosdex: SKOS utilities via jruby

I just announced this on the public-esw-thes and public-rdf-ruby lists. I started to make a Ruby API for SKOS.

Example code snippet from the readme.txt (see that link for the corresponding output):

require "src/jena_skos"
s1 ="")"" )"file:samples/archives.rdf")
s1.concepts.each_pair do |url,c|
  puts "SKOS: #{url} label: #{c.prefLabel}"

c1 = s1.concepts[""] # Agronomy
puts "test concept is "+ c1 + " " + c1.prefLabel
c1.narrower do |uri|
  c2 = s1.concepts[uri]
  puts "\tnarrower: "+ c2 + " " + c2.prefLabel
  c2.narrower do |uri|
    c3 = s1.concepts[uri]
    puts "\t\tnarrower: "+ c3 + " " + c3.prefLabel

The idea here is to have a lightweight OO API for SKOS, couched in terms of a network of linked “Concepts”, with broader and narrower relations. But this is backed by a full RDF API (in our case Jena, via Java jruby magic). Eventually, entire apps could be built at the SKOS API level. For now, anything beyond broader/narrower and prefLabel is hidden away in the RDF (and so you’d need to dip into the Jena API to get to this data).

The distinguishing feature is that it uses jruby (a Ruby implementation in pure Java). As such it can call on the full powers of the Jena toolkit, which go far beyond anything available currently in Ruby. At the moment it doesn’t do much, I just parse SKOS and make a tiny object model which exposes little more than prefLabel and broader/narrower.

I think it’s worth exploring because Ruby is rather nice for scripting, but lacks things like OWL reasoners and the general maturity of Java RDF/OWL tools (parsers, databases, etc.).

If you’re interested just to see how Jena APIs look when called from jruby Ruby, see jena_skos.rb in svn. Excuse the mess.

I’m interested to hear if anyone else has explored this topic. Obviously there is a lot more to SKOS than broader/narrower, so I’m very interested to find collaborators or at least a sanity check before taking this beyond a rough demo.

Plans – well my main concern is nothing to do with java or ruby, … but to explore Lucene indexing of SKOS data. I am also very interested in the pragmatic question of where SKOS stops and RDFS/OWL starts, … and how exactly we bridge that gap. See flickr for my most recent sketch of this landscape, where I revisit the idea of an “it” property (skos:it, foaf:it, …) that links things described in SKOS to “the thing itself”. I hope to load up enough overlapping SKOS data to get some practical experience with the tradeoffs.

For query expansion, smarter tagging assistants, etc. So the next step is probably to try building a Lucene index similar to the contrib/wordnet utility that ships with Java lucene. This creates a Lucene index in which every “document” is really a word from Wordnet, with text labels for its synonyms as indexed properties. I also hope to look at the use of SKOS + Lucene for “did you mean?” and auto-completion utilities. It’s also worth noting that Jena ships with LARQ, a Lucene-aware extension to ARQ, Jena’s SPARQL engine.

Family trees, Gedcom::FOAF in CPAN, and provenance

Every wondered who the mother(s) of Adam and Eve’s grand-children were? Me too. But don’t expect SPARQL or the Semantic Web to answer that one! Meanwhile, …

You might nevetheless care to try the Gedcom::FOAF CPAN module from Brian Cassidy. It can read Gedcom, a popular ‘family history’ file format, and turn it into RDF (using FOAF and the relationship and biography vocabularies). A handy tool that can open up a lot of data to SPARQL querying.

The Gedcom::FOAF API seems to focus on turning the people or family Gedcom entries  into their own FOAF XML files. I wrote a quick and horrid Perl script that runs over a Gedcom file and emits a single flattened RDF/XML document. While URIs for non-existent XML files are generated, this isn’t a huge problem.

Perhaps someone would care to take a look at this code and see whether a more RDFa and linked-data script would be useful?

Usage: perl BUELL001.GED > _sample_gedfoaf.rdf

The sample data I tested it on is intriguing, though I’ve not really looked around it yet.

It contains over 9800 people including the complete royal lines of England, France, Spain and the partial royal lines of almost all other European countries. It also includes 19 United States Presidents descended from royalty, including Washington, both Roosevelts, Bush, Jefferson, Nixon and others. It also has such famous people as Brigham Young, William Bradford, Napoleon Bonaparte, Winston Churchill, Anne Bradstreet (Dudley), Jesus Christ, Daniel Boone, King Arthur, Jefferson Davis, Brian Boru King of Ireland, and others. It goes all the way back to Adam and Eve and also includes lines to ancient Rome including Constantine the Great and ancient Egypt including King Tutankhamen (Tut).

The data is credited to Matt & Ellie Buell, “Uploaded By: Eochaid”, 1995-05-25.

Here’s an extract to give an idea of the Gedcom form:

0 @I4961@ INDI
1 NAME Adam //
1 REFN +
2 DATE ABT 4000 BC
2 PLAC Eden
2 DATE ABT 3070 BC
1 FAMS @F2398@
1 NOTE He was the first human on Earth.
1 SOUR Genesis 2:20 KJV
0 @I4962@ INDI
1 NAME Eve //
1 REFN +
2 DATE ABT 4000 BC
2 PLAC Eden
1 FAMS @F2398@
1 SOUR Genesis 3:20 KJV

It might not directly answer the great questions of biblical scholarship, but it could be a fun dataset to explore Gedcom / RDF mappings with. I wonder how it compares with Freebase, DBpedia etc.

The Perl module is a good start for experimentation but it only really scratches the surface of the problem of representing source/provenance and uncertainty. On which topic, Jeni Tennison has a post from a year ago that’s well worth (re-)reading.

What I’ve done in the above little Perl script is implement a simplification: instead of each family description being its own separate XML file, they are all squashed into a big flat set of triples (‘graph’). This may or may not be appropriate, depending on the sourcing of the records. It seems Gedcom offers some basic notion of ‘source’, although not one expressed in terms of URIs. If I look in the SOUR(ce) field in the Gedcom file, I see information like this (which currently seems to be ignored in the Gedcom::FOAF mapping):

grep SOUR BUELL001.GED | sort | uniq

1 NOTE !SOURCE:Burford Genealogy, Page 102 Cause of Death; Hemorrage of brain
1 NOTE !SOURCE:Gertrude Miller letter “Harvey Lee lived almost 1 year. He weighed
1 NOTE !SOURCE:Gertrude Miller letter “Lynn died of a ruptured appendix.”
1 NOTE !SOURCE:Gertrude Miller letter “Vivian died of a tubal pregnancy.”
1 SOUR “Castles” Game Manuel by Interplay Productions
1 SOUR “Mayflower Descendants and Their Marriages” pub in 1922 by Bureau of
1 SOUR “Prominent Families of North Jutland” Pub. in Logstor, Denmark. About 1950
1 SOUR /*- TUT
1 SOUR 273
1 SOUR AHamlin777.  E-Mail “Descendents of some guy
1 SOUR Blundell, Sherrie Lea (Slingerland).  information provided on 16 Apr 1995
1 SOUR Blundell, William, Rev. Interview on Jan 29, 1995.
1 SOUR Bogert, Theodore. AOL user “TedLBJ” File uploaded to American Online
1 SOUR Buell, Barbara Jo (Slingerland)
1 SOUR Buell, Beverly Anne (Wenge)
1 SOUR Buell, Beverly Anne (Wenge).  letter addressed to Kim & Barb Buell dated
1 SOUR Buell, Kimberly James.
1 SOUR Buell, Matthew James. written December 19, 1994.
1 SOUR Burnham, Crystal (Harris).  Leter sent to Matt J. Buell on Mar 18, 1995.
1 SOUR Burnham, Crystal Colleen (Harris).  AOL user CBURN1127.  E-mail “Re: [...etc.]

Some of these sources could be tied to cleaner IDs (eg. for books c/o Open Library, although see ‘in search of cultural identifiers‘ from Michael Smethurst).

I believe RDF’s SPARQL language gives us a useful tool (the notion of ‘GRAPH’) that can be applied here, but we’re a long way from having worked out the details when it comes to attaching evidence to claims. So for now, we in the RDF scene have a fairly course-grained approach to data provenance. Databases are organized into batches of triples, ie. RDF statements that claim something about the world. And while we can use these batches – aka graphs – in our queries, we haven’t really figured out what kind of information we want to associate with them yet. Which is a pity, since this could have uses well beyond family history, for example to online journalistic practices and blog-mediated fact checking.

Nearby in the Web: see also the SIOC/SWAN telecons, a collaboration in the W3C SemWeb lifescience community around the topic of modelling scientific discourse.

Cross-browsing and RDF

Cross-browsing and RDF

While cross-searching has been described and demonstrated through this paper and associated work, the problem of cross-browsing a selection of subject gateways has not been addressed. Many gateway users prefer to browse, rather than search. Though browsing usually takes longer than searching, it can be more thorough, as it is not dependent on the users terms matching keywords in resource descriptions (even when a thesaurus is used, it is possible for resources to be “missed” if they are not described in great detail).

As a “quick fix”, a group of gateways may create a higher level menu that points to the various browsable menus amongst the gateways. However, this would not be a truly hierarchical menu system, as some gateways maintain browsable resource menus in the same atomic (or lowest level) subject area. One method of enabling cross-browsing is by the use of RDF.

The World Wide Web Consortium has recently published a preliminary draft specification for the Resource Description Framework (RDF). RDF is intended to provide a common framework for the exchange of machine-understandable information on the Web. The specification provides an abstract model for representing arbitrarily complex statements about networked resources, as well as a concrete XML-based syntax for representing these statements in textual form. RDF relies heavily on the notion of standard vocabularies, and work is in progress on a ‘schema’ mechanism that will allow user communities to express their own vocabularies and classification schemes within the RDF model.

RDF’s main contribution may be in the area of cross-browsing rather than cross-searching, which is the focus of the CIP. RDF promises to deliver a much-needed standard mechanism that will support cross-service browsing of highly-organised resources. There are many networked services available which have classified their resources using formal systems like MeSH or UDC. If these services were to each make an RDF description of their collection available, it would be possible to build hierarchical ‘views’ of the distributed services offering a user interface organised by subject-classification rather than by physical location of the resource.

From Cross-Searching Subject Gateways, The Query Routing and Forward Knowledge Approach, Kirriemuir et. al., D-Lib Magazine, January 1998.

I wrote this over 11 (eleven) years ago, as something of an aside during a larger paper on metadata for distributed search. While we are making progress towards such goals, especially with regard to cross-referenced descriptions of identifiable things (ie. the advances made through linked data techniques lately), the pace of progress can be quite frustrating. Just as it seems like we’re making progress, things take a step backwards. For example, the wonderful site is currently offline while the relevant teams at the Library of Congress figure out how best to proceed. It’s also ten years since Charlotte Jenkins published some great work on auto-classification that used OCLC’s Dewey Decimal Classification. That work also ran into problems, since DDC wasn’t freely available for use in such applications. In the current climate, with Creative Commons, Open source, Web 2.0 and suchlike the rage, I hope we’ll finally see more thesaurus and classification systems opened up (eg. with SKOS) and fully linked into the Web. Maybe by 2019 the Web really will be properly cross-referenced…

SKOS deployment stats from Sindice

This cropped up in yesterday’s W3C Semantic Web Coordination Group telecon, as we discussed the various measures of SKOS deployment success.

I suggested drawing a distinction between the use of SKOS to publish thesauri (ie. SKOS schemes), and the use of SKOS in RDFS/OWL schemas, for example subclassing of skos:Concept or defining properties whose range or domain are skos:Concept. A full treatment would look for a variety of constructs (eg. new properties that declare themselves subPropertyOf something in SKOS).

An example of such a use of SKOS is the new sioc:Category class, recently added to the SIOC namespace.

Here are some quick experiments with Sindice.

Search results for advanced “* <> <>”, found 10

Search results for advanced “* <> <>”, found 10

Search results for advanced “* <> <>”, found 18

Here’s a query that finds all mentions of skos:Concept in an object role within an RDF statement:

Search results for advanced “* * <>”, found about 432.32 thousand

This all seems quite healthy, although I’ve not clicked through to explore many of these results yet.

BTW I also tried using the proposed (but retracted – see CR request notes) new SKOS namespace, (unless I’m mistaken). I couldn’t find any data in Sindice yet that was using this namespace.