Subject classification and Statistics

Subject classification and statistics share some common problems. This post takes a small example discussed at this week’s ODaF event on “Semantic Statistics” in Tilberg, and explores its expression coded in the Universal Decimal Classification (UDC). UDC supports faceted description, providing an abstract grammar allowing sentence-like subject descriptions to be composed from the “raw materials” defined in its vocabulary scheme.

This makes the mapping of UDC (and to some extent also Dewey classifications)  into W3C’s SKOS somewhat lossy, since patterns and conventions for documenting these complex, composed structures are not yet well established. In the NoTube project we are looking into this in a TV context, in large part because the BBC archives make extensive use of UDC via their Lonclass scheme; see my ‘investigating Lonclass‘ and UDC seminar talk for more on those scenarios. Until this week I hadn’t thought enough about the potential for using this to link deep into statistical datasets.

One of the examples discussed on Tuesday was as follows (via Richard Cyganiak):

“There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″

There was much interesting discussion in Tilburg about the proper scope and role of Linked Data techniques for sharing this kind of statistical data. Do we use RDF essentially as metadata, to find ‘black boxes’ full of stats, or do we use RDF to try to capture something of what the statistics are telling us about the world? When do we use RDF as simple factual data directly about the world (eg. school X has N pupils [currently; or at time t]), and when does it become a carrier for raw numeric data whose meaning is not so directly expressed at the factual level?

The state of the art in applying RDF here seems to be SDMX-RDF, see Richard’s slides. The SDMX-RDF work uses SKOS to capture code lists, to describe cross-domain concepts and to indicate subject matter.

Given all this, I thought it would be worth taking this tiny example and looking at how it might look in UDC, both as an example of the ‘compositional semantics’ some of us hope to capture in extended SKOS descriptions, but also to explore scenarios that cross-link numeric data with the bibliographic materials that can be found via library classification techniques such as UDC. So I asked the ever-helpful Aida Slavic (editor in chief of the UDC), who talked me through how this example data item looks from a UDC perspective.

I asked,

So I’ve just got home from a meeting on semweb/stats. These folk encode data values with stuff like “There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″. How much of that could have a UDC coding? I guess I should ask, how would subject index a book whose main topic was “occupational injuries in the Washington DC metro area in 2008″?

Aida’s reply (posted with permission):

You can present all of it & much more using UDC. When you encode a subject like this in UDC you store much more information than your proposed sentence actually contains. So my decision of how to ‘translate this into udc’ would depend on learning more about the actual text and the context of the message it conveys, implied audience/purpose, the field of expertise for which the information in the document may be relevant etc. I would probably wonder whether this is a research report, study, news article, textbook, radio broadcast?

Not knowing more then you said I can play with the following: 331.46(735.215.2/.4)”2008

Accidents at work — Washington metropolitan area — year 2008
or a bit more detailed:  331.46-053.18(735.215.2/.4)”2008
Accidents at work — dead persons – Washington metropolitan area — year 2008
[you can say the number of dead persons but this is not pertinent from point of view of indexing and retrieval]

…or maybe (depending what is in the content and what is the main message of the text) and because you used the expression ‘fatal injuries’ this may imply that this is more health and safety/ prevention area in health hygiene which is in medicine.

The UDC structures composed here are:

TIME “2008”

PLACE (735.215.2/.4)  Counties in the Washington metropolitan area

TOPIC 1
331     Labour. Employment. Work. Labour economics. Organization of  labour
331.4     Working environment. Workplace design. Occupational safety.  Hygiene at work. Accidents at work
331.46  Accidents at work ==> 614.8

TOPIC 2
614   Prophylaxis. Public health measures. Preventive treatment
614.8    Accidents. Risks. Hazards. Accident prevention. Persona protection. Safety
614.8.069    Fatal accidents

NB – classification provides a bit more context and is more precise than words when it comes to presenting content i.e. if the content is focused on health and safety regulation and occupation health then the choice of numbers and their order would be different e.g. 614.8.069:331.46-053.18 [relationship between] health & safety policies in prevention of fatal injuries and accidents at work.

So when you read  UDC number 331.46 you do not see only e.g. ‘accidents at work’ but  ==>  ‘accidents at work < occupational health/safety < labour economics, labour organization < economy
and when you see UDC number 614.8  it is not only fatal accidents but rather ==> ‘fatal accidents < accident prevention, safety, hazards < Public health and hygiene. Accident prevention

When you see (735.2….) you do not only see Washington but also United States, North America

So why is this interesting? A couple of reasons…

1. Each of these complex codes combines several different hierarchically organized components; just as they can be used to explore bibliographic materials, similar approaches might be of value for navigating the growing collections of public statistical data. If SKOS is to be extended / improved to better support subject classification structures, we should take care also to consider use cases from the world of statistics and numeric data sharing.

2. Multilingual aspects. There are plans to expose SKOS data for the upper levels of UDC. An HTML interface to this “UDC summary” is already available online, and includes collected translations of textual labels in many languages (see progress report) . For example, we can look up 331.4 and find (in hierarchical context) definitions in English (“Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work”), alongside e.g. Spanish (“Entorno del trabajo. Diseño del lugar de trabajo. Seguridad laboral. Higiene laboral. Accidentes de trabajo”), CroatianArmenian, …

Linked Data is about sharing work; if someone else has gone to the trouble of making such translations, it is probably worth exploring ways of re-using them. Numeric data is (in theory) linguistically neutral; this should make linking to translations particularly attractive. Much of the work around RDF and stats is about providing sufficient context to the raw values to help us understand what is really meant by “66” in some particular dataset. By exploiting SDMX-RDF’s use of SKOS, it should be possible to go further and to link out to the wider literature on workplace fatalities. This kind of topical linking should work in both directions: exploring out from numeric data to related research, debate and findings, but also coming in and finding relevant datasets that are cross-referenced from books, articles and working papers. W3C recently launched a Library Linked Data group, I look forward to learning more about how libraries are thinking about connecting numeric and non-numeric information.