Lonclass is one of the BBC’s in-house classification systems – the “London classification”. I’ve had the privilege of investigating lonclass within the NoTube project. It’s not currently public, but much of what I say here is also applicable to the Universal Decimal Classification (UDC) system upon which it was based. UDC is also not fully public yet; I’ve made a case elsewhere that it should be, and I hope we’ll see that within my lifetime. UDC and Lonclass have a fascinating history and are rich cultural heritage artifacts in their own right, but I’m concerned here only with their role as the keys to many of our digital and real-world archives.
Why would we want to map Lonclass or UDC subject classification codes into RDF?
Lonclass codes can be thought of as compact but potentially complex sentences, built from the thousands of base ‘words’ in the Lonclass dictionary. By mapping the basic pieces, the words, to other data sources, we also enrich the compound sentences. We can’t map all of the sentences as there can be infinitely many of them – it would be an expensive and never-ending task.
For example, we might have a lonclass code for “Report on the environmental impact of the decline of tin mining in Sweden in the 20th century“. This would be an jumble of numbers and punctuation which I won’t trouble you with here. But if we parsed out that structure we can see the complex code as built from primitives such as ‘tin mining’ (itself e.g. ‘Tin’ and ‘Mining’), ‘Sweden’, etc. By linking those identifiable parts to shared Web data, we also learn more about the complex composite codes that use them. Wikipedia’s Sweden entry tells us in English, “Sweden has land borders with Norway to the west and Finland to the northeast, and water borders with Denmark, Germany, and Poland to the south, and Estonia, Latvia, Lithuania, and Russia to the east.”. Increasingly this additional information is available in machine-friendly form. Although right now we can’t learn about Sweden’s borders from the bits of Wikipedia reflected into DBpedia’s Sweden entry, but UN FAO’s geopolitical ontology does have this information and more in RDF form.
There is more, much more, to know about Sweden than can possibly be represented directly within Lonclass or UDC. Yet those facts may also be very useful for the retrieval of information tagged with Sweden-related Lonclass codes. If we map the Lonclass notion of ‘Sweden’ to identified concepts described elsewhere, then whenever we learn more about the latter, we also learn more about the former, and indirectly, about anything tagged with complex lonclass codes using that concept. Suddenly an archived TV documentary tagged as covering a ‘report on the environmental impact of the decline of tin mining in Sweden’ is accessible also to people or machines looking under Scandinavia + metal mining. Environmental matters, after all, often don’t respect geo-political borders; someone searching for coverage of environmental trends in a neighbouring country might well be happy to find this documentary. But should Lonclass or UDC maintain an index of which countries border which others? Surely not!
Lonclass and UDC codes have a rich hidden structure that is rarely exploited with modern tools. Lonclass by virtue of its UDC heritage, does a lot of work itself towards representing complex conceptual inter-relationships. It embodies a conceptual map of our world, with mysterious codes (well known in the library world) for topics such as ‘622 – mining’, but also specifics e.g. ‘622.3 Mining of specific minerals, ores, rocks’, and combinations (‘622.3:553.9 Extraction of carbonaceous minerals, hydrocarbons’). By joining a code for ‘mining a specific mineral…’ to a code for ‘553.9 Deposits of carbonaceous rocks. Hydrocarbon deposits’ we get a compound term. So Lonclass/UDC “knows” about the relationship between “Tin Mining” and “Mining”, “metals” etc., and quite likely between “Sweden” and “Scandinavia”. But it can’t know everything! Sooner or later, we have to say, “Sorry, it’s not reasonable to expect the classification system to model the entire world; that’s a bigger problem”.
Even within the closed, self-supporting universe of UDC/Lonclass, this compositional semantics system is a very powerful tool for describing obscure topics in terms of well known simpler concepts. But it’s too much for any single organisation (whether the BBC, the UDC Consortium, or anyone) to maintain and extend such a system to cover all of modern life; from social, legal and business developments to new scientific innovations. The work needs to be shared, and RDF is currently our best bet on how to create such work sharing, meaning sharing, information-linking systems in the Web. The hierarchies in UDC and Lonclass don’t attempt to represent all of objective reality; they instead show paths through information.
If the metaphor of a ‘conceptual map’ holds up, then it’s clear that at some point it’s useful to have our maps made by different parties, with different specialised knowledge. The Web now contains a smaller but growing Web of machine readable descriptions. Over at MusicBrainz is a community who take care of describing the entities and relationships that cover much of music, or at least popular music. Others describe countries, species, genetics, languages, historical events, economics, and countless other topics. The data is sometimes messy or an imperfect fit for some task-in-hand, but it is actively growing, curated and connected.
I’m not arguing that Lonclass or UDC should be thrown out and replaced by some vague ‘linked cloud’. Rather, that there are some simple steps that can be taken towards making sure each of these linked datasets contribute to modernising our paths into the archives. We need to document and share opensource tools for an agreed data model for the arcane numeric codes of UDC and Lonclass. We need at least the raw pieces, the simplest codes, to be described for humans and machines in public, stable Web pages, and for their re-use, mapping, data mining and re-combination to be actively encouraged and celebrated. Currently, it is possible to get your hands on this data if you work with the BBC (Lonclass), pay license fees (UDC) or exchange USB sticks with the right party in some shady backstreet. Whether the metaphor of choice is ‘key to the archives’ or ‘conceptual map of…’, this is a deeply unfortunate situation, both for the intrinsic public value of these datasets, but also for the collections they index. There’s a wealth of meaning hidden inside Lonclass and UDC and the collections they index, a lot that can be added by linking it to other RDF datasets, but more importantly there are huge communities out there who’ll do much of the work when the data is finally opened up…
I wrote too much. What I meant to say is simple. Classification systems with compositional semantics can be enriched when we map their basic terms using identifiers from other shared data sets. And those in the UDC/Lonclass tradition, while in some ways they’re showing their age (weird numeric codes, huge monolithic, hard-to-maintain databases), … are also amongst the most interesting systems we have today for navigating information, especially when combined with Linked Data techniques and companion datasets.