in coding

Skosdex progress: basic lucene search

I now have a crude Lucene index derrived from SKOS data. It is more or less a toy example, but somehow promising also.

Example below is a test against FAO‘s AGROVOC. Each concept becomes a “document”, with a “word” field containing the prefLabel, and a “uri” field for the concept URI. I don’t index anything else yet.

The hope here is to have a handy prototyping environment for testing different indexing regimes. The code takes about 4-5 mins to index AGROVOC on my MacBook, running under Jruby.

The data I’m using is a SKOS dump from the FAO Web site, post-processed with “grep -v” to skip the Farsi lines, due to a Unicode error. The transcript below comes from running Lucli, a handy command line tool for Lucene.

Next steps with indexing? Not sure. Probably make sure altLabel is handled. But I’m also curious about possibility of including fields that pull in labels from nearby concepts, so they can be matched in weighted searches. Would be hard to evaluate the effectiveness though.

lucli> search uri:”http://www.fao.org/aims/aos/agrovoc#c_47934″
Searching for: uri:”http www.fao.org aims aos agrovoc c_47934″
1 total matching documents
————————————–
—————- 1 score:1.0———————
word:Pteria hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_47934
#################################################
lucli> search word:”Leiocottus hirundo”
Searching for: word:”leiocottus hirundo”
1 total matching documents
————————————–
—————- 1 score:1.0———————
word:Leiocottus hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_45393
#################################################
lucli> search word:”hirundo”
Searching for: word:hirundo
2 total matching documents
————————————–
—————- 1 score:1.0———————
word:Pteria hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_47934
—————- 2 score:1.0———————
word:Leiocottus hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_45393
#################################################

Add Comment Register