Quick clarification on SPARQL extensions and “Lock-in”

It’s clear from discussion bouncing around IRC, Twitter, Skype and elsewhere that “Lock-in” isn’t a phrase to use lightly.

So I post this to make myself absolutely clear. A few days ago I mentioned in IRC a concern that newcomers to SPARQL and RDF databases might not appreciate which SPARQL extensions are widely implemented, and which are the specialist offerings of the system they happen to be using. I mentioned OpenLink’s Virtuoso in particular as a SPARQL implementation that had a rich and powerful set of extensions.

Since it seems there is some risk I might be mis-interpreted as suggesting OpenLink are actively trying to “do a Microsoft” and trap users in some proprietary pseudo-SPARQL, I’ll state what I took to be obvious background knowledge: OpenLink is a company who owe their success to the promotion of cross-vendor database portability, they have been tireless advocates of a standards-based Semantic Web, and they’re active in proposing extensions to W3C for standardisation. So – no criticism of OpenLink intended. None at all.

All I think we need here, are a few utilities that help developers understand the nature of the various SPARQL dialects and the potential costs/benefits of using them. Perhaps an online validator, alongside those for RDF/XML, RDFa, Turtle etc. Such a validator might usefully list the extensions used in some query, and give pointers (perhaps into a wiki) where the status of the various extensions constructs can be discussed and documented.

Since SPARQL is such a young language, it lacks a lot of things that are taken from granted in the SQL world, and so using rich custom extensions when available is for many developers a sensible choice. My only concern is that it must be a choice, and one entered into consciously.

Getting started with Mozilla Jetpack for Thunderbird (on OSX)

A few weeks ago, I started to experiment with Mozilla’s new Jetpack extension model when it became available for Thunderbird. Revisiting the idea today, I realise I’d forgotten the basic setup details, so am recording them here for future reference.

I found I had to download the source from the Mercurial Web interface, rather than use pre-prepared XPI installers. This may have improved by the time you read this. I also learned (from Standard9 in #jetpack IRC) that I need asuth’s repository, rather than the main one. Again, things move quickly, don’t assume this is true forever.

Here is what worked for me, on OSX.

1. Grab a .zip from the Jetpack repo, and unpack it locally on a machine that has Thunderbird installed.

2. Edit extensions/install.rdf and make sure the em:maxVersion in the Thunderbird section matches your version of Thunderbird. In mine I updated it to say <em:maxVersion>3.0b4</em:maxVersion> (instead of 3.0b4pre).

3.  See the README in the jetpack filetree for installation. With Thunderbird closed, I ran “python manage.py install –app=thunderbird” and I found Jetpack installed fine.

4. Run Thunderbird, you should see an about:jetpack tab, and corresponding options in the Tools menu.

This was enough to get started. See discussion on visophyte.org for some example code.

After installation, you can use the about:jetpack windows to load, reload and delete Jetpacks from URL.

So, why would you bother doing all this? Jetpack provides a simple way of extending an email client using Web technology.

In my current (unfinished!) experiment, for example, I’m looking at making a sidebar the shows information (photo, blog etc.) about the sender of the currently-viewed email. And I figured that if I blogged this HOWTO, someone more familiar with ajax, jquery etc might care to help with wiring this up to the Google Social Graph JSON API, so we can use FOAF and XFN to provide more contextual information around incoming mail…

Assuming you are running Thunderbird 3b4

Mirrors and Prisms: robust site-specific browsers

Mozilla (amongst others, see Chris Messina’s writeup of the trend, also Matt’s) have been exploring site-specific browsers through their Prism project. These combine aspects of the Web and Desktop environments, allowing you to have a desktop app tuned for browsing just one specific Web site. Prism is an application which, when run, will generate new per-site desktop applications. Currently it does not yet have a fancy packaging/installer, so users will need to install Prism plus the site files separately.

I have started to look at Prism as a basis for accessing robust, mirrored sites, so that a single point of failure (or censorship) might be avoided. With a lot help from Matt and others in #prism IRC chat, I have something almost working. The idea is simple: hack Prism so that the running browser code intercepts clicks and (based on some as-yet-undefined logic and preferences) gets the page from a list of mirrors, which might also be fetched dynamically from the ‘net.

I should also mention that one motivation here is for anti-censorship tools, to give users an easy way to access sites which might be blocked by their IP address or URL otherwise. I looked at FoxyProxy as an option but for site-specific robustness, running a full proxy server seems a bit heavy, compared to simply duplicating a set of files. Here’s what the main Prism app looks like:

prism-gutenberg

Screenshot showing Prism config settings for a site-specific browser.

Once you have Prism installed, you can hack a file named webrunner.js to intervene when links are clicked. In OSX, this can be found as /Applications/Prism.app/Contents/Resources/chrome/webrunner/content/webrunner.js.

Edit this: _domActivate : function(aEvent)

I added the following block to the start of this function:

var link = aEvent.target;
if (link instanceof HTMLAnchorElement && !WebRunner._isLinkExternal(link)) {
aEvent.preventDefault();
WebRunner._getBrowser().loadURI(“http://example.org/mirrors/”+link.href,null,null);
}

The idea here being that we intercept clicks, and rewrite them to point to equivalent http:// URIs elsewhere in the Web. As far as this goes, it works as advertised. But what I have is far from working… it would need some code in there to find the right mirror URLs to fetch from. Perhaps a list might be fetched on startup or first time a link is followed. It could also do with some work on packaging, so that this hacked version of Prism plus some actual site-specific browser config can be made into an easy-install Windows .exe or OSX .app. For a Windows installer, I am told that NSIS is a good place to start. You could also imagine a version that hid the mirrored URLs from user’s view. Since Prism has a built-in option to completely hide the URL navigation bar, I didn’t investigate this idea yet.

OK I think I’ve written up everything I learned from the helpful folks in IRC. I hope this repays some karma. If anyone cares to explore this further, or wants to help target student projects on exploring it, please get in touch.

Rick Jelliffe on XML Schema

From the TAG list:

XML Schemas is like using a Swiss Army knife to cook with. Most Asian kitchens get by with a handful of simple tools: chopsticks, hatchet, a good knife, perhaps even a spoon. But the logic of  the XSD WG is “Oh, the French need to make quenelles, we must have a quenelling spoon as a grave matter of Internationalization because it is not our business to judge what people need… as long it is more stuff.”    So XSD 1.1 welds another Swiss Army knife onto the existing one, so that no kitchen should suffer without a quenelling spoon.

See also earlier comments on the Schema Experience Workshop from W3C.

So tool-makers blame users for generating non-standard schemas, and users blame the spec for being to difficult to know whether their schemas are standard or not, and spec makers blame tool makers for not implementing the spec properly. Who will free us from this cycle of sin and death?

[...] The only way that XML Schemas can be refactored is with a different core XML Schemas working group. My current expectation is that a lot of nothing will happen until XQuery/XSLT2 becomes seen as a more central technology than XML Schemas; the goal will then be how to support XQuery most minimally.

XSD doesn’t trouble me as much as it troubles Rick, but I have long sympathised with the approach he advocates with Schematron. The RDF equivalent of this is the approach Libby and I called “Schemarama”, expressing constraints against RDF instance data using queries. See original 2001 demo using SquishQL, and a later reworking by Alistair Miles using SPARQL (currently offline?). Recent work from the OWL experts at Clark & Parsia (blog post; another blog post) is heading in the same direction. I wonder whether Rick’s observation about XML applies to RDF too, and that at some point, SPARQL querying facilities will be so ubiquitous in RDF tools that it becomes second nature to apply it to data checking tasks too…?

Update: see also SpinRDF from Holger & co. at Top Quadrant

Site recovery

Busy sysadmin week. The main FOAF site is back, now hosted on Amazon EC2. Thanks to Stephane Corlosquet for all the time he spent fixing up the Drupal installation, after the recent server compromise. I’ve also moved over danbri.org (well, DNS is propagating), and migrated my blog into a completely fresh WordPress installation. The FOAF namespace site and Subversion server are safe, and not yet migrated to new hosting. Various documents from danbri.org are still offline while I scrub all the HTML, .js, .php etc for mischief. The old rdfweb.org site is also offline. I’d rather move slowly and carefully than mess up this process. This is a test post from the new WordPress to see if it works. Note that I’ve stripped all plugins and addons and will be much more conservative with trying extensions in the future. In particular, OpenID-based commenting isn’t working right now, but it’s on the todo list. One of the most disconcerting things about being hacked is when the site is also your OpenID. I’m wondering how to better partition things in the future; perhaps using id.danbri.org might give some more options?

Flickr & MusicBrainz Machine tags: If you’ve got it, flaunt it

From Sander van Zoest at Uncensored Interview, a convention for representing MusicBrainz identifiers using Flickr’s Machine Tag mechanism.

Example:

A photo of Matthew Dear, tagged as follows:

It also includes a Wikipedia identifier which could be used to link to DBpedia (though this might duplicate information also available within MusicBrainz’s advanced relationships system). There must be many 1000s of artist photos on Flickr, perhaps we’ll see tools to improve their tagging so they can be re-used more easily…

Nearby: Matthew Dear in DBpedia(RDF), in Freebase (RDF), …

Skosdex progress: basic lucene search

I now have a crude Lucene index derrived from SKOS data. It is more or less a toy example, but somehow promising also.

Example below is a test against FAO‘s AGROVOC. Each concept becomes a “document”, with a “word” field containing the prefLabel, and a “uri” field for the concept URI. I don’t index anything else yet.

The hope here is to have a handy prototyping environment for testing different indexing regimes. The code takes about 4-5 mins to index AGROVOC on my MacBook, running under Jruby.

The data I’m using is a SKOS dump from the FAO Web site, post-processed with “grep -v” to skip the Farsi lines, due to a Unicode error. The transcript below comes from running Lucli, a handy command line tool for Lucene.

Next steps with indexing? Not sure. Probably make sure altLabel is handled. But I’m also curious about possibility of including fields that pull in labels from nearby concepts, so they can be matched in weighted searches. Would be hard to evaluate the effectiveness though.

lucli> search uri:”http://www.fao.org/aims/aos/agrovoc#c_47934″
Searching for: uri:”http www.fao.org aims aos agrovoc c_47934″
1 total matching documents
————————————–
—————- 1 score:1.0———————
word:Pteria hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_47934
#################################################
lucli> search word:”Leiocottus hirundo”
Searching for: word:”leiocottus hirundo”
1 total matching documents
————————————–
—————- 1 score:1.0———————
word:Leiocottus hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_45393
#################################################
lucli> search word:”hirundo”
Searching for: word:hirundo
2 total matching documents
————————————–
—————- 1 score:1.0———————
word:Pteria hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_47934
—————- 2 score:1.0———————
word:Leiocottus hirundo
uri:http://www.fao.org/aims/aos/agrovoc#c_45393
#################################################

Skosdex: SKOS utilities via jruby

I just announced this on the public-esw-thes and public-rdf-ruby lists. I started to make a Ruby API for SKOS.

Example code snippet from the readme.txt (see that link for the corresponding output):

require "src/jena_skos"
s1 = SKOS.new("http://norman.walsh.name/knows/taxonomy")
s1.read("http://www.wasab.dk/morten/blog/archives/author/mortenf/skos.rdf" )
s1.read("file:samples/archives.rdf")
s1.concepts.each_pair do |url,c|
  puts "SKOS: #{url} label: #{c.prefLabel}"
end

c1 = s1.concepts["http://www.ukat.org.uk/thesaurus/concept/1366"] # Agronomy
puts "test concept is "+ c1 + " " + c1.prefLabel
c1.narrower do |uri|
  c2 = s1.concepts[uri]
  puts "\tnarrower: "+ c2 + " " + c2.prefLabel
  c2.narrower do |uri|
    c3 = s1.concepts[uri]
    puts "\t\tnarrower: "+ c3 + " " + c3.prefLabel
  end
end

The idea here is to have a lightweight OO API for SKOS, couched in terms of a network of linked “Concepts”, with broader and narrower relations. But this is backed by a full RDF API (in our case Jena, via Java jruby magic). Eventually, entire apps could be built at the SKOS API level. For now, anything beyond broader/narrower and prefLabel is hidden away in the RDF (and so you’d need to dip into the Jena API to get to this data).

The distinguishing feature is that it uses jruby (a Ruby implementation in pure Java). As such it can call on the full powers of the Jena toolkit, which go far beyond anything available currently in Ruby. At the moment it doesn’t do much, I just parse SKOS and make a tiny object model which exposes little more than prefLabel and broader/narrower.

I think it’s worth exploring because Ruby is rather nice for scripting, but lacks things like OWL reasoners and the general maturity of Java RDF/OWL tools (parsers, databases, etc.).

If you’re interested just to see how Jena APIs look when called from jruby Ruby, see jena_skos.rb in svn. Excuse the mess.

I’m interested to hear if anyone else has explored this topic. Obviously there is a lot more to SKOS than broader/narrower, so I’m very interested to find collaborators or at least a sanity check before taking this beyond a rough demo.

Plans – well my main concern is nothing to do with java or ruby, … but to explore Lucene indexing of SKOS data. I am also very interested in the pragmatic question of where SKOS stops and RDFS/OWL starts, … and how exactly we bridge that gap. See flickr for my most recent sketch of this landscape, where I revisit the idea of an “it” property (skos:it, foaf:it, …) that links things described in SKOS to “the thing itself”. I hope to load up enough overlapping SKOS data to get some practical experience with the tradeoffs.

For query expansion, smarter tagging assistants, etc. So the next step is probably to try building a Lucene index similar to the contrib/wordnet utility that ships with Java lucene. This creates a Lucene index in which every “document” is really a word from Wordnet, with text labels for its synonyms as indexed properties. I also hope to look at the use of SKOS + Lucene for “did you mean?” and auto-completion utilities. It’s also worth noting that Jena ships with LARQ, a Lucene-aware extension to ARQ, Jena’s SPARQL engine.