Skosdex progress: basic lucene search

I now have a crude Lucene index derrived from SKOS data. It is more or less a toy example, but somehow promising also.

Example below is a test against FAO‘s AGROVOC. Each concept becomes a “document”, with a “word” field containing the prefLabel, and a “uri” field for the concept URI. I don’t index anything else yet.

The hope here is to have a handy prototyping environment for testing different indexing regimes. The code takes about 4-5 mins to index AGROVOC on my MacBook, running under Jruby.

The data I’m using is a SKOS dump from the FAO Web site, post-processed with “grep -v” to skip the Farsi lines, due to a Unicode error. The transcript below comes from running Lucli, a handy command line tool for Lucene.

Next steps with indexing? Not sure. Probably make sure altLabel is handled. But I’m also curious about possibility of including fields that pull in labels from nearby concepts, so they can be matched in weighted searches. Would be hard to evaluate the effectiveness though.

lucli> search uri:”http://www.fao.org/aims/aos/agrovoc#c_47934″
Searching for: uri:”http www.fao.org aims aos agrovoc c_47934″
1 total matching documents
————————————–
—————- 1 score:1.0———————
word:Pteria hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_47934
#################################################
lucli> search word:”Leiocottus hirundo”
Searching for: word:”leiocottus hirundo”
1 total matching documents
————————————–
—————- 1 score:1.0———————
word:Leiocottus hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_45393
#################################################
lucli> search word:”hirundo”
Searching for: word:hirundo
2 total matching documents
————————————–
—————- 1 score:1.0———————
word:Pteria hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_47934
—————- 2 score:1.0———————
word:Leiocottus hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_45393
#################################################

Facebook problem statement

People want full ownership and control of their information so they can turn off access to it at any time. At the same time, people also want to be able to bring the information others have shared with them—like email addresses, phone numbers, photos and so on—to other services and grant those services access to those people’s information. These two positions are at odds with each other. There is no system today that enables me to share my email address with you and then simultaneously lets me control who you share it with and also lets you control what services you share it with.
“On Facebook, People Own and Control Their Information”, Mark Zuckerberg, Facebook blog.

Skosdex: SKOS utilities via jruby

I just announced this on the public-esw-thes and public-rdf-ruby lists. I started to make a Ruby API for SKOS.

Example code snippet from the readme.txt (see that link for the corresponding output):

require "src/jena_skos"
s1 = SKOS.new("http://norman.walsh.name/knows/taxonomy")
s1.read("http://www.wasab.dk/morten/blog/archives/author/mortenf/skos.rdf" )
s1.read("file:samples/archives.rdf")
s1.concepts.each_pair do |url,c|
  puts "SKOS: #{url} label: #{c.prefLabel}"
end

c1 = s1.concepts["http://www.ukat.org.uk/thesaurus/concept/1366"] # Agronomy
puts "test concept is "+ c1 + " " + c1.prefLabel
c1.narrower do |uri|
  c2 = s1.concepts[uri]
  puts "\tnarrower: "+ c2 + " " + c2.prefLabel
  c2.narrower do |uri|
    c3 = s1.concepts[uri]
    puts "\t\tnarrower: "+ c3 + " " + c3.prefLabel
  end
end

The idea here is to have a lightweight OO API for SKOS, couched in terms of a network of linked “Concepts”, with broader and narrower relations. But this is backed by a full RDF API (in our case Jena, via Java jruby magic). Eventually, entire apps could be built at the SKOS API level. For now, anything beyond broader/narrower and prefLabel is hidden away in the RDF (and so you’d need to dip into the Jena API to get to this data).

The distinguishing feature is that it uses jruby (a Ruby implementation in pure Java). As such it can call on the full powers of the Jena toolkit, which go far beyond anything available currently in Ruby. At the moment it doesn’t do much, I just parse SKOS and make a tiny object model which exposes little more than prefLabel and broader/narrower.

I think it’s worth exploring because Ruby is rather nice for scripting, but lacks things like OWL reasoners and the general maturity of Java RDF/OWL tools (parsers, databases, etc.).

If you’re interested just to see how Jena APIs look when called from jruby Ruby, see jena_skos.rb in svn. Excuse the mess.

I’m interested to hear if anyone else has explored this topic. Obviously there is a lot more to SKOS than broader/narrower, so I’m very interested to find collaborators or at least a sanity check before taking this beyond a rough demo.

Plans – well my main concern is nothing to do with java or ruby, … but to explore Lucene indexing of SKOS data. I am also very interested in the pragmatic question of where SKOS stops and RDFS/OWL starts, … and how exactly we bridge that gap. See flickr for my most recent sketch of this landscape, where I revisit the idea of an “it” property (skos:it, foaf:it, …) that links things described in SKOS to “the thing itself”. I hope to load up enough overlapping SKOS data to get some practical experience with the tradeoffs.

For query expansion, smarter tagging assistants, etc. So the next step is probably to try building a Lucene index similar to the contrib/wordnet utility that ships with Java lucene. This creates a Lucene index in which every “document” is really a word from Wordnet, with text labels for its synonyms as indexed properties. I also hope to look at the use of SKOS + Lucene for “did you mean?” and auto-completion utilities. It’s also worth noting that Jena ships with LARQ, a Lucene-aware extension to ARQ, Jena’s SPARQL engine.

Family trees, Gedcom::FOAF in CPAN, and provenance

Every wondered who the mother(s) of Adam and Eve’s grand-children were? Me too. But don’t expect SPARQL or the Semantic Web to answer that one! Meanwhile, …

You might nevetheless care to try the Gedcom::FOAF CPAN module from Brian Cassidy. It can read Gedcom, a popular ‘family history’ file format, and turn it into RDF (using FOAF and the relationship and biography vocabularies). A handy tool that can open up a lot of data to SPARQL querying.

The Gedcom::FOAF API seems to focus on turning the people or family Gedcom entries  into their own FOAF XML files. I wrote a quick and horrid Perl script that runs over a Gedcom file and emits a single flattened RDF/XML document. While URIs for non-existent XML files are generated, this isn’t a huge problem.

Perhaps someone would care to take a look at this code and see whether a more RDFa and linked-data script would be useful?

Usage: perl gedcom2foafdump.pl BUELL001.GED > _sample_gedfoaf.rdf

The sample data I tested it on is intriguing, though I’ve not really looked around it yet.

It contains over 9800 people including the complete royal lines of England, France, Spain and the partial royal lines of almost all other European countries. It also includes 19 United States Presidents descended from royalty, including Washington, both Roosevelts, Bush, Jefferson, Nixon and others. It also has such famous people as Brigham Young, William Bradford, Napoleon Bonaparte, Winston Churchill, Anne Bradstreet (Dudley), Jesus Christ, Daniel Boone, King Arthur, Jefferson Davis, Brian Boru King of Ireland, and others. It goes all the way back to Adam and Eve and also includes lines to ancient Rome including Constantine the Great and ancient Egypt including King Tutankhamen (Tut).

The data is credited to Matt & Ellie Buell, “Uploaded By: Eochaid”, 1995-05-25.

Here’s an extract to give an idea of the Gedcom form:

0 @I4961@ INDI
1 NAME Adam //
1 SEX M
1 REFN +
1 BIRT
2 DATE ABT 4000 BC
2 PLAC Eden
1 DEAT
2 DATE ABT 3070 BC
1 FAMS @F2398@
1 NOTE He was the first human on Earth.
1 SOUR Genesis 2:20 KJV
0 @I4962@ INDI
1 NAME Eve //
1 SEX F
1 REFN +
1 BIRT
2 DATE ABT 4000 BC
2 PLAC Eden
1 FAMS @F2398@
1 SOUR Genesis 3:20 KJV

It might not directly answer the great questions of biblical scholarship, but it could be a fun dataset to explore Gedcom / RDF mappings with. I wonder how it compares with Freebase, DBpedia etc.

The Perl module is a good start for experimentation but it only really scratches the surface of the problem of representing source/provenance and uncertainty. On which topic, Jeni Tennison has a post from a year ago that’s well worth (re-)reading.

What I’ve done in the above little Perl script is implement a simplification: instead of each family description being its own separate XML file, they are all squashed into a big flat set of triples (‘graph’). This may or may not be appropriate, depending on the sourcing of the records. It seems Gedcom offers some basic notion of ‘source’, although not one expressed in terms of URIs. If I look in the SOUR(ce) field in the Gedcom file, I see information like this (which currently seems to be ignored in the Gedcom::FOAF mapping):

grep SOUR BUELL001.GED | sort | uniq

1 NOTE !SOURCE:Burford Genealogy, Page 102 Cause of Death; Hemorrage of brain
1 NOTE !SOURCE:Gertrude Miller letter “Harvey Lee lived almost 1 year. He weighed
1 NOTE !SOURCE:Gertrude Miller letter “Lynn died of a ruptured appendix.”
1 NOTE !SOURCE:Gertrude Miller letter “Vivian died of a tubal pregnancy.”
1 SOUR “Castles” Game Manuel by Interplay Productions
1 SOUR “Mayflower Descendants and Their Marriages” pub in 1922 by Bureau of
1 SOUR “Prominent Families of North Jutland” Pub. in Logstor, Denmark. About 1950
1 SOUR /*- TUT
1 SOUR 273
1 SOUR AHamlin777.  E-Mail “Descendents of some guy
1 SOUR Blundell, Sherrie Lea (Slingerland).  information provided on 16 Apr 1995
1 SOUR Blundell, William, Rev. Interview on Jan 29, 1995.
1 SOUR Bogert, Theodore. AOL user “TedLBJ” File uploaded to American Online
1 SOUR Buell, Barbara Jo (Slingerland)
1 SOUR Buell, Beverly Anne (Wenge)
1 SOUR Buell, Beverly Anne (Wenge).  letter addressed to Kim & Barb Buell dated
1 SOUR Buell, Kimberly James.
1 SOUR Buell, Matthew James. written December 19, 1994.
1 SOUR Burnham, Crystal (Harris).  Leter sent to Matt J. Buell on Mar 18, 1995.
1 SOUR Burnham, Crystal Colleen (Harris).  AOL user CBURN1127.  E-mail “Re: [...etc.]

Some of these sources could be tied to cleaner IDs (eg. for books c/o Open Library, although see ‘in search of cultural identifiers‘ from Michael Smethurst).

I believe RDF’s SPARQL language gives us a useful tool (the notion of ‘GRAPH’) that can be applied here, but we’re a long way from having worked out the details when it comes to attaching evidence to claims. So for now, we in the RDF scene have a fairly course-grained approach to data provenance. Databases are organized into batches of triples, ie. RDF statements that claim something about the world. And while we can use these batches – aka graphs – in our queries, we haven’t really figured out what kind of information we want to associate with them yet. Which is a pity, since this could have uses well beyond family history, for example to online journalistic practices and blog-mediated fact checking.

Nearby in the Web: see also the SIOC/SWAN telecons, a collaboration in the W3C SemWeb lifescience community around the topic of modelling scientific discourse.

FRBR and W3C Media Annotations

Just spotted this review of FRBR (Functional Requirements for Bibliographic Records) in the group’s wiki. Also some interesting notes on modelling. It seems the Media Annotation work is starting out well, both in terms of the analysis they’re performing, the relationships they’re seeing (FRBR is from the Library world and often passed over by industry metadata efforts). But things also look healthy in that they’re working on a public list, with notes kept in a public wiki. This is a much healthier W3C than the one I first encountered in ’97, when everything was somewhat hidden away. Unfortunately, attitudes linger – it is still common to see W3C critiqued for being a closed, secretive body, when the truth is that many WGs operate in a very public, collaborative manner.

Cross-browsing and RDF

Cross-browsing and RDF

While cross-searching has been described and demonstrated through this paper and associated work, the problem of cross-browsing a selection of subject gateways has not been addressed. Many gateway users prefer to browse, rather than search. Though browsing usually takes longer than searching, it can be more thorough, as it is not dependent on the users terms matching keywords in resource descriptions (even when a thesaurus is used, it is possible for resources to be “missed” if they are not described in great detail).

As a “quick fix”, a group of gateways may create a higher level menu that points to the various browsable menus amongst the gateways. However, this would not be a truly hierarchical menu system, as some gateways maintain browsable resource menus in the same atomic (or lowest level) subject area. One method of enabling cross-browsing is by the use of RDF.

The World Wide Web Consortium has recently published a preliminary draft specification for the Resource Description Framework (RDF). RDF is intended to provide a common framework for the exchange of machine-understandable information on the Web. The specification provides an abstract model for representing arbitrarily complex statements about networked resources, as well as a concrete XML-based syntax for representing these statements in textual form. RDF relies heavily on the notion of standard vocabularies, and work is in progress on a ‘schema’ mechanism that will allow user communities to express their own vocabularies and classification schemes within the RDF model.

RDF’s main contribution may be in the area of cross-browsing rather than cross-searching, which is the focus of the CIP. RDF promises to deliver a much-needed standard mechanism that will support cross-service browsing of highly-organised resources. There are many networked services available which have classified their resources using formal systems like MeSH or UDC. If these services were to each make an RDF description of their collection available, it would be possible to build hierarchical ‘views’ of the distributed services offering a user interface organised by subject-classification rather than by physical location of the resource.

From Cross-Searching Subject Gateways, The Query Routing and Forward Knowledge Approach, Kirriemuir et. al., D-Lib Magazine, January 1998.

I wrote this over 11 (eleven) years ago, as something of an aside during a larger paper on metadata for distributed search. While we are making progress towards such goals, especially with regard to cross-referenced descriptions of identifiable things (ie. the advances made through linked data techniques lately), the pace of progress can be quite frustrating. Just as it seems like we’re making progress, things take a step backwards. For example, the wonderful lcsh.info site is currently offline while the relevant teams at the Library of Congress figure out how best to proceed. It’s also ten years since Charlotte Jenkins published some great work on auto-classification that used OCLC’s Dewey Decimal Classification. That work also ran into problems, since DDC wasn’t freely available for use in such applications. In the current climate, with Creative Commons, Open source, Web 2.0 and suchlike the rage, I hope we’ll finally see more thesaurus and classification systems opened up (eg. with SKOS) and fully linked into the Web. Maybe by 2019 the Web really will be properly cross-referenced…

OpenID – a clash of expectations?

Via Dan Connolly, this from the mod_auth_openid FAQ:

Q: Is it possible to limit login to some users, like htaccess/htpasswd does?

A: No. It is possible to limit authentication to certain identity providers (by using AuthOpenIDDistrusted and AuthOpenIDTrusted, see the main page for more info). If you want to restrict to specific users that span multiple identity providers, then OpenID probably isn’t the authentication method you want. Note that you can always do whatever vetting you want using the REMOTE_USER CGI environment variable after a user authenticates.

Funny, this is just what I thought was most interesting about OpenID: it lets you build sites where you can offer a varying experiences (including letting them in or not) to differ users based on what you know about them. OpenID itself doesn’t do everything out of the box, but by associating public URIs with people, it’s a very useful step.

A year ago I sketched a scenario in this vein (and it seems to have survived sanity check from Simon Willison, or at least he quotes it). It seems perhaps that OpenID is all things to all people…?

SKOS deployment stats from Sindice

This cropped up in yesterday’s W3C Semantic Web Coordination Group telecon, as we discussed the various measures of SKOS deployment success.

I suggested drawing a distinction between the use of SKOS to publish thesauri (ie. SKOS schemes), and the use of SKOS in RDFS/OWL schemas, for example subclassing of skos:Concept or defining properties whose range or domain are skos:Concept. A full treatment would look for a variety of constructs (eg. new properties that declare themselves subPropertyOf something in SKOS).

An example of such a use of SKOS is the new sioc:Category class, recently added to the SIOC namespace.

Here are some quick experiments with Sindice.

Search results for advanced “* <http://www.w3.org/2000/01/rdf-schema#domain> <http://www.w3.org/2004/02/skos/core#Concept>”, found 10

Search results for advanced “* <http://www.w3.org/2000/01/rdf-schema#range> <http://www.w3.org/2004/02/skos/core#Concept>”, found 10

Search results for advanced “* <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://www.w3.org/2004/02/skos/core#Concept>”, found 18

Here’s a query that finds all mentions of skos:Concept in an object role within an RDF statement:

Search results for advanced “* * <http://www.w3.org/2004/02/skos/core#Concept>”, found about 432.32 thousand

This all seems quite healthy, although I’ve not clicked through to explore many of these results yet.

BTW I also tried using the proposed (but retracted – see CR request notes) new SKOS namespace,
http://www.w3.org/2008/05/skos#Concept (unless I’m mistaken). I couldn’t find any data in Sindice yet that was using this namespace.

Semantic Web – Use it or lose it

ESWC2009 Semantic Web In Use track

While papers submitted to the scientific track may provide evidence of scientific contribution through applications and evaluations (see 4. and 5. of the conference Topics of Interest), papers submitted to the Semantic Web In Use Track should be organised around some of or all of the following aspects:- Description of concrete problems in specific application domains, for which Semantic Web technologies can provide a solution.

  • Description of concrete problems in specific application domains, for which Semantic Web technologies can provide a solution.
  • Description of an implemented application of Semantic Web technologies in a specific domain
  • Assessment of the pros and cons of using Semantic Web technologies to solve a particular business problem in a specific domain
  • Comparison with alternative or competing approaches using conventional or competing technologies
  • Assessment of the costs and benefits of the application of Semantic Web Technologies, e.g. time spent on implementation and deployment, efforts involved, user acceptance, returns on investment
  • Evidence of deployment of the application, and assessment/evaluation of usage/uptake.

One thing I would encourage here (in the tradition of the Journal of Negative Results), is that people remember that negative experience is still experience. While the SemWeb technology stack has much to recommend it, there are also many circumstances when it isn’t quite the right fit. Or when alternative SemWeb approaches (GRDDL, SQL2SPARQL, …) can bring similar advantages with lower costs. I would like to see some thoughtful and painfully honest writeups of cases where Semantic Web technologies haven’t quite worked out as planned. Technology projects fail all the time; there’s nothing to be ashamed of. But when it’s a technology project that uses standards and tools I’ve contributed to, I really want to know more about what went wrong, if anything went wrong. ESWC2009 seems a fine place to share experiences and to learn how to better use these technologies…