Subject classification and Statistics

Subject classification and statistics share some common problems. This post takes a small example discussed at this week’s ODaF event on “Semantic Statistics” in Tilberg, and explores its expression coded in the Universal Decimal Classification (UDC). UDC supports faceted description, providing an abstract grammar allowing sentence-like subject descriptions to be composed from the “raw materials” defined in its vocabulary scheme.

This makes the mapping of UDC (and to some extent also Dewey classifications)  into W3C’s SKOS somewhat lossy, since patterns and conventions for documenting these complex, composed structures are not yet well established. In the NoTube project we are looking into this in a TV context, in large part because the BBC archives make extensive use of UDC via their Lonclass scheme; see my ‘investigating Lonclass‘ and UDC seminar talk for more on those scenarios. Until this week I hadn’t thought enough about the potential for using this to link deep into statistical datasets.

One of the examples discussed on Tuesday was as follows (via Richard Cyganiak):

“There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″

There was much interesting discussion in Tilburg about the proper scope and role of Linked Data techniques for sharing this kind of statistical data. Do we use RDF essentially as metadata, to find ‘black boxes’ full of stats, or do we use RDF to try to capture something of what the statistics are telling us about the world? When do we use RDF as simple factual data directly about the world (eg. school X has N pupils [currently; or at time t]), and when does it become a carrier for raw numeric data whose meaning is not so directly expressed at the factual level?

The state of the art in applying RDF here seems to be SDMX-RDF, see Richard’s slides. The SDMX-RDF work uses SKOS to capture code lists, to describe cross-domain concepts and to indicate subject matter.

Given all this, I thought it would be worth taking this tiny example and looking at how it might look in UDC, both as an example of the ‘compositional semantics’ some of us hope to capture in extended SKOS descriptions, but also to explore scenarios that cross-link numeric data with the bibliographic materials that can be found via library classification techniques such as UDC. So I asked the ever-helpful Aida Slavic (editor in chief of the UDC), who talked me through how this example data item looks from a UDC perspective.

I asked,

So I’ve just got home from a meeting on semweb/stats. These folk encode data values with stuff like “There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″. How much of that could have a UDC coding? I guess I should ask, how would subject index a book whose main topic was “occupational injuries in the Washington DC metro area in 2008″?

Aida’s reply (posted with permission):

You can present all of it & much more using UDC. When you encode a subject like this in UDC you store much more information than your proposed sentence actually contains. So my decision of how to ‘translate this into udc’ would depend on learning more about the actual text and the context of the message it conveys, implied audience/purpose, the field of expertise for which the information in the document may be relevant etc. I would probably wonder whether this is a research report, study, news article, textbook, radio broadcast?

Not knowing more then you said I can play with the following: 331.46(735.215.2/.4)”2008

Accidents at work — Washington metropolitan area — year 2008
or a bit more detailed:  331.46-053.18(735.215.2/.4)”2008
Accidents at work — dead persons – Washington metropolitan area — year 2008
[you can say the number of dead persons but this is not pertinent from point of view of indexing and retrieval]

…or maybe (depending what is in the content and what is the main message of the text) and because you used the expression ‘fatal injuries’ this may imply that this is more health and safety/ prevention area in health hygiene which is in medicine.

The UDC structures composed here are:

TIME “2008”

PLACE (735.215.2/.4)  Counties in the Washington metropolitan area

331     Labour. Employment. Work. Labour economics. Organization of  labour
331.4     Working environment. Workplace design. Occupational safety.  Hygiene at work. Accidents at work
331.46  Accidents at work ==> 614.8

614   Prophylaxis. Public health measures. Preventive treatment
614.8    Accidents. Risks. Hazards. Accident prevention. Persona protection. Safety
614.8.069    Fatal accidents

NB – classification provides a bit more context and is more precise than words when it comes to presenting content i.e. if the content is focused on health and safety regulation and occupation health then the choice of numbers and their order would be different e.g. 614.8.069:331.46-053.18 [relationship between] health & safety policies in prevention of fatal injuries and accidents at work.

So when you read  UDC number 331.46 you do not see only e.g. ‘accidents at work’ but  ==>  ‘accidents at work < occupational health/safety < labour economics, labour organization < economy
and when you see UDC number 614.8  it is not only fatal accidents but rather ==> ‘fatal accidents < accident prevention, safety, hazards < Public health and hygiene. Accident prevention

When you see (735.2….) you do not only see Washington but also United States, North America

So why is this interesting? A couple of reasons…

1. Each of these complex codes combines several different hierarchically organized components; just as they can be used to explore bibliographic materials, similar approaches might be of value for navigating the growing collections of public statistical data. If SKOS is to be extended / improved to better support subject classification structures, we should take care also to consider use cases from the world of statistics and numeric data sharing.

2. Multilingual aspects. There are plans to expose SKOS data for the upper levels of UDC. An HTML interface to this “UDC summary” is already available online, and includes collected translations of textual labels in many languages (see progress report) . For example, we can look up 331.4 and find (in hierarchical context) definitions in English (“Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work”), alongside e.g. Spanish (“Entorno del trabajo. Diseño del lugar de trabajo. Seguridad laboral. Higiene laboral. Accidentes de trabajo”), CroatianArmenian, …

Linked Data is about sharing work; if someone else has gone to the trouble of making such translations, it is probably worth exploring ways of re-using them. Numeric data is (in theory) linguistically neutral; this should make linking to translations particularly attractive. Much of the work around RDF and stats is about providing sufficient context to the raw values to help us understand what is really meant by “66” in some particular dataset. By exploiting SDMX-RDF’s use of SKOS, it should be possible to go further and to link out to the wider literature on workplace fatalities. This kind of topical linking should work in both directions: exploring out from numeric data to related research, debate and findings, but also coming in and finding relevant datasets that are cross-referenced from books, articles and working papers. W3C recently launched a Library Linked Data group, I look forward to learning more about how libraries are thinking about connecting numeric and non-numeric information.

Dublin Core “RDF idiom conversion rules” in SPARQL

This note is both an earlybird (and partial) technical review of the new SPARQL RDF query spec and a proposal to the DC-Architecture Working Group for a possible representation language that captures some rules that explain how some different forms of DC relate to each other. As a review, I have to say that I now find CONSTRUCT more useful than I originally expected. Maybe a full W3C RDF Rules language would be a useful thing to do after all… :)

The Dublin Core metadata vocabulary is used sometimes as a set of relationships between documents (etc.) and other resources (entities, things); and sometimes as a set of relationships to textual strings associated with those things. The DCMI Abstract Model document gives some more detail on this, and proposes a model to explain DC thinking on this issue. Appendix B of that
document describes the mapping of this abstraction into RDF.

One option (todo: find url of Andy’s doc listing options) being explored by the DC-Architecture WG is for Dublin Core to publish explicit guidelines explaining how the two representational idioms can be interconverted. What I try to do below is capture DC community concensus about how the two styles of using properties from the vocabulary compare.

The rules in prose:

Rule 1: Whenever we see a DC property from applied to some resource ‘x’, and the value of that resource is something that is not a literal (ie. a string or markup), then any values of the rdfs:label property which apply to that thing, are also values of our DC property. eg. if a document has a dc:creator whose value is a thing that has an rdfs:label “Andy Powell”, then it is also true that the document has a dc:creator property whose value is the literal string “Andy Powell”.

Rule 2: As above, but reversed. Whenever there is some string that is an rdfs:Literal that is the value of a DC property for some resource ‘x’, then it will also be true that there exists some resource that is not an rdfs:Literal, and that has an rdfs:label whose value is that string, and that ‘x’ stands in the same DC property relationship to that resource.

Problem statement: this sort of thing is (see above) hard to express in prose, we’d like a clean machine-readable way of expressing these rules, so that test cases can be formulated, mailing list discussions can be made more precise, and (perhaps) so that software tools can directly exploit these rules to use with RDF.

The SPARQL query language is a W3C work-in-progress for querying RDF data. It mainly provides some mechanisms for specifying bits of RDF to match in some query, but also provides a basic mechanism for CONSTRUCTing further RDF statements based on combining matched data with a template structure. This is a preliminary investigation into the possibility of using SPARQL’s CONSTRUCT mechanism to express the rules being discussed in the DC community. Readers should note that both SPARQL and the DC “idiom conversion rules” are both works in progress, and that I might not have full grasp of either.

Anyway, here goes:

Rule 1 in SPARQL: generate simple form from explicit form

PREFIX dc:   <http ://>
PREFIX rdfs: <http ://>
PREFIX rdf: <http ://>
CONSTRUCT   ( ?x ?relation ?string )
WHERE       ( ?x ?relation ?thing )
            ( ?thing rdf:type x:NonLiteralResource)
            ( ?thing rdfs:label ?string)
            ( ?relation rdfs:isDefinedBy 

Rule 2 in SPARQL: explicit form from string form (harder)

PREFIX dc:   <http ://>
PREFIX rdfs: <http ://>
PREFIX rdf: <http ://>
CONSTRUCT   ( ?x ?relation ?thing )
            ( ?thing rdfs:label ?string)
WHERE    ( ?x ?relation ?string )
            ( ?string rdf:type rdfs:Literal)
            ( ?relation rdfs:isDefinedBy

See also: comments in SPARQL draft on “what if construct graph includes unbound variables”…

followup: Dan Connolly and others have noted that I should be careful not to advocate the use of SPARQL as a Rule language. I lost his specific comments due to problems with WordPress, but agree entirely. What I was trying to do here was see how it might look if one did try to express rules using these queries, to better understand how the two technologies relate. Please don’t consider SPARQL a rule language (but please do share your experiences with having RDF Rule, Query and OWL languages work together…).