Querying Linked GeoData with R SPARQL client

Assuming you already have the R statistics toolkit installed, this should be easy.
Install Willem van Hage‘s R SPARQL client. I followed the instructions and it worked, although I had to also install the XML library, which was compiled and installed when I typed install.packages(“XML“, repos = “http://www.omegahat.org/R“) ‘ within the R interpreter.
Yesterday I set up a simple SPARQL endpoint using Benjamin Nowack’s ARC2 and RDF data from the Ravensburg dataset. The data includes category information about many points of interest in a German town. We can type the following 5 lines into R and show R consuming SPARQL results from the Web:
  • library(SPARQL)
  • endpoint = “http://foaf.tv/hypoid/sparql.php
  • q = “PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>\nPREFIX foaf:\n<http://xmlns.com/foaf/0.1/>\nPREFIX rv:\n<http://www.wifo-ravensburg.de/rdf/semanticweb.rdf#>\nPREFIX gr:\n<http://purl.org/goodrelations/v1#>\n \nSELECT ?poi ?l ?lon ?lat ?k\nWHERE {\nGRAPH <http://www.heppresearch.com/dev/dump.rdf> {\n?poi\nvcard:geo ?l .\n  ?l vcard:longitude ?lon .\n  ?l vcard:latitude ?lat\n.\n ?poi foaf:homepage ?hp .\n?poi rv:kategorie ?k .\n\n}\n}\n”
  • res<-SPARQL(endpoint,q)
  • pie(table(res$k))

This is the simplest thing that works to show the data flow. When combined with richer server-side support (eg. OWL tools, or spatial reasoning) and the capabilities of R plus its other extensions, there is a lot of potential here. A pie chart doesn’t capture all that, but it does show how to get started…

Note also that you can send any SPARQL query you like, so long as the server understands it and responds using W3C’s standard XML response. The R library doesn’t try to interpret the query, so you’re free to make use of any special features or experimental extensions understood by the server.

Exploring Linked Data with Gremlin

Gremlin is a free Java/Groovy system for traversing graphs, including but not limited to RDF. This post is based on example code from Marko Rodriguez (@twarko) and the Gremlin wiki and mailing list. The test run below goes pretty slowly when run with 4 or 5 loops, since it uses the Web as its database, via entry-by-entry fetches. In this case it’s fetching from DBpedia, but I’ve ran it with a few tweaks against Freebase happily too. The on-demand RDF is handled by the Linked Data Sail developed by Joshua Shinavier; same thing would work directly against a database too. If you like Gremlin you’ll also Joshua’s Ripple work (see screencast, code, wiki).

Why is this interesting? Don’t we already have SPARQL? And SPARQL 1.1 even has paths.  I’d like to see a bit more convergence with SPARQL, but this is a different style of dealing with graph data. The most intriguing difference from SPARQL here is the ability to drop in Turing-complete fragments throughout the ‘query'; for example in the { closure block } shown below. I’m also, for hard-to-articulate reasons, reminded somehow of Apache’s Pig language. Although Pig doesn’t allow arbitrary script, it does encourage a pipeline perspective on data processing.

So in this example we start exploring the graph from one vertex, we’ll call it ‘fry’, representing Stephen Fry’s dbpedia entry. The idea is to collect up information about actors and their co-starring patterns as recorded in Wikipedia.

Here is the full setup code needed; it can be run interactively in the Gremlin commandline console. So it runs quickly we loop only twice.

g = new LinkedDataSailGraph(new MemoryStoreSailGraph())
fry = g.v(‘http://dbpedia.org/resource/Stephen_Fry’)
g.addNamespace(‘wp’, ‘http://dbpedia.org/ontology/’)
m = [:]

From here, everything else is in one line:
fry.in(‘wp:starring’).out(‘wp:starring’).groupCount(m).loop(3){it.loops <2}

This corresponds to a series of steps (which map to TinkerPop / Blueprints / Pipes API calls behind the scenes). I’ve marked the steps in bold here:

  • in: ‘wp:starring': from our initial vertice, representing Stephen Fry, we step to vertices that point to us with a ‘wp:starring’ link
  • out: from those vertices, we follow outgoing edges marked ‘wp:starring’ (including those back to Stephen Fry), taking us to things that he and his co-stars starred in, i.e. TV shows and films.
  • we then call groupCount and pass it our bookkeeping hashtable, ‘m’. It increments a counter based on ID of current vertex or edge. As we revisit the same vertex later, the total counter for that entity goes up.
  • from this point, we then go back 3 steps, and recurse several times.  e.g. “{ it.loops < 3 }” (this last is a closure; we can drop any code in here…)

This maybe gives a flavour. See the Gremlin Wiki for the real goods. The first version of this post was verbose, as I had Gremlin step explictly into graph edges, and back into vertices each time. Gremlin allows edges to have properties, which is useful both for representing non-RDF data, but also for apps to keep annotations on RDF triples. It also exposes ‘named graph’ URIs on each edge with an ‘ng’ property. You can step from a vertex into an edge with ‘inE’, ‘outE’ and other steps; again check the wiki for details.

From an application and data perspective, the Gremlin system is interesting as it allows quantitatively minded graph explorations to be used alongside classically factual SPARQL. The results below show that it can dig out an actor’s co-stars (and then take account of their co-stars, and so on). This sort of neighbourhood exploration helps balance out the messyness of much Linked Data; rather than relying on explicitly asserted facts from the dataset, we can also add in derived data that comes from counting things expressed in dozens or hundreds of pages.

Once the Gremlin loops are finished, we can examine the state of our book-keeping object, ‘m':

Back in the gremlin.sh commandline interface (effectively typing in Groovy) we can do this…

gremlin> m2 = m.sort{ a,b -> b.value <=> a.value }

==>v[http://dbpedia.org/resource/Stephen_Fry]=58
==>v[http://dbpedia.org/resource/Hugh_Laurie]=9
==>v[http://dbpedia.org/resource/Tony_Robinson]=6
==>v[http://dbpedia.org/resource/Rowan_Atkinson]=6
==>v[http://dbpedia.org/resource/Miranda_Richardson]=4
==>v[http://dbpedia.org/resource/Tim_McInnerny]=4
==>v[http://dbpedia.org/resource/Tony_Slattery]=3
==>v[http://dbpedia.org/resource/Emma_Thompson]=3
==>v[http://dbpedia.org/resource/Robbie_Coltrane]=3
==>v[http://dbpedia.org/resource/John_Lithgow]=2
==>v[http://dbpedia.org/resource/Emily_Watson]=2
==>v[http://dbpedia.org/resource/Colin_Firth]=2
==>v[http://dbpedia.org/resource/Sandi_Toksvig]=1
==>v[http://dbpedia.org/resource/John_Sessions]=1
==>v[http://dbpedia.org/resource/Greg_Proops]=1
==>v[http://dbpedia.org/resource/Paul_Merton]=1
==>v[http://dbpedia.org/resource/Mike_McShane]=1
==>v[http://dbpedia.org/resource/Ryan_Stiles]=1
==>v[http://dbpedia.org/resource/Colin_Mochrie]=1
==>v[http://dbpedia.org/resource/Josie_Lawrence]=1
[...]

Now how would this look if we looped around a few more times? i.e. re ran our co-star traversal from each of the final vertices we settled on?
Here are the results from a longer run. The difference you see will depend upon the shape of the graph, the kind of link types you’re traversing, and so forth. And also, of course, on the nature of the things in the world that the graph describes. Here are the Gremlin results when we loop 5 times instead of 2:

==>v[http://dbpedia.org/resource/Stephen_Fry]=8160
==>v[http://dbpedia.org/resource/Hugh_Laurie]=3641
==>v[http://dbpedia.org/resource/Rowan_Atkinson]=2481
==>v[http://dbpedia.org/resource/Tony_Robinson]=2168
==>v[http://dbpedia.org/resource/Miranda_Richardson]=1791
==>v[http://dbpedia.org/resource/Tim_McInnerny]=1398
==>v[http://dbpedia.org/resource/Emma_Thompson]=1307
==>v[http://dbpedia.org/resource/Robbie_Coltrane]=1303
==>v[http://dbpedia.org/resource/Tony_Slattery]=911
==>v[http://dbpedia.org/resource/Colin_Firth]=854
==>v[http://dbpedia.org/resource/John_Lithgow]=732 [...]

Video Linking: Archives and Encyclopedias

This is a quick visual teaser for some archive.org-related work I’m doing with NoTube colleagues, and a collaboration with Kingsley Idehen on navigating it.

In NoTube we are trying to match people and TV content by using rich linked data representations of both. I love Archive.org and with their help have crawled an experimental subset of the video-related metadata for the Archive. I’ve also used a couple of other sources; Sean P. Aune’s list of 40 great movies, and the Wikipedia page listing US public domain films. I fixed, merged and scraped until I had a reasonable sample dataset for testing.

I wanted to test the Microsoft Pivot Viewer (a Silverlight control), and since OpenLink’s Virtuoso package now has built-in support, I got talking with Kingsley and we ended up with the following demo. Since not everyone has Silverlight, and this is just a rough prototype that may be offline, I’ve made a few screenshots. The real thing is very visual, with animated zooms and transitions, but screenshots give the basic idea.

Notes: the core dataset for now is just links between archive.org entries and Wikipedia/dbpedia pages. In NoTube we’ll also try Lupedia, Zemanta, Reuter’s OpenCalais services on the Archive.org descriptions to see if they suggest other useful links and categories, as well as any other enrichment sources (delicious tags, machine learning) we can find. There is also more metadata from the Archive that we should also be using.

This simple preview simply shows how one extra fact per Archived item creates new opportunities for navigation, discovery and understanding. Note that the UI is in no way tuned to be TV, video or archive specific; rather it just lets you explore a group of items by their ‘facets’ or common properties. It also reveals that wiki data is rather chaotic, however some fields (release date, runtime, director, star etc.) are reliably present. And of course, since the data is from Wikipedia, users can always fix the data.

You often hear Linked Data enthusiasts talk about data “silos”, and the need to interconnect them. All that means here, is that when collections are linked, then improvements to information on one side of the link bring improvements automatically to the other. When a Wikipedia page about a director, actor or movie is improved, it now also improves our means of navigating Archive.org’s wonderful collection. And when someone contributes new video or new HTML5-powered players to the Archive, they’re also enriching the Encyclopedia too.

Archive.org films on a timeline by release date according to Wikipedia.

One thing to mention is that everything here comes from the Wikipedia data that is automatically extracted from by DBpedia, and that currently the extractors are not working perfectly on all films. So it should get better in the future. I also added a lot of the image links myself, semi-automatically. For now, this navigation is much more factually-based than topic; however we do have Wikipedia categories for each film, director, studio etc., and these have been mapped to other category systems (formal and informal), so there’s a lot of other directions to explore.

What else can we do? How about flip the tiled barchart to organize by the film’s distributor, and constrain the ‘release date‘ facet to the 1940s:

That’s nice. But remember that with Linked Data, you’re always dealing with a subset of data. It’s hard to know (and it’s hard for the interface designers to show us) when you have all the relevant data in hand. In this case, we can see what this is telling us about the videos currently available within the demo. But does it tell us anything interesting about all the films in the Archive? All the films in the world? Maybe a little, but interpretation is difficult.

Next: zoom in to a specific item. The legendary Plan 9 from Outer Space (wikipedia / dbpedia).

Note the HTML-based info panel on the right hand side. In this case it’s automatically generated by Virtuoso from properties of the item. A TV-oriented version would be less generic.

Finally, we can explore the collection by constraining the timeline to show us items organized according to release date, for some facet. Here we show it picking out the career of one Edward J. Kay, at least as far as he shows up as composer of items in this collection:

Now turning back to Wikipedia to learn about ‘Edward J. Kay’, I find he has no entry (beyond these passing mentions of his name) in the English Wikipedia, despite his work on The Ape Man, The Fatal Hour, and other films.  While the German Wikipedia does honour him with an entry, I wonder whether this kind of Linked Data navigation will change the dynamics of the ‘deletionism‘ debates at Wikipedia.  Firstly by showing that structured data managed elsewhere can enrich the Wikipedia (and vice-versa), removing some pressure for a single Wiki to cover everything. Secondly it provides a tool to stand further back from the data and view things in a larger context; a context where for example Edward J. Kay’s achievements become clearer.

Much like Freebase Parallax, the Pivot viewer hints at a future in which we explore data by navigating from sets of things to other sets of things - like the set of film’s Edward J. Kay contributed to.  Pivot doesn’t yet cover this, but it does very vividly present the potential for this kind of navigation, showing that navigation of films, TV shows and actors may be richer when it embraces more general mechanisms.

Disambiguating with DBpedia

Sketchy notes. Say you’re looking for an identifier for something, and you know it’s a company/organization, and you have a label “Woolworths”.

What can be done to choose amongst the results we find in DBpedia for this crude query?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?x where {
?x a <http://dbpedia.org/ontology/Organisation>;  rdfs:label ?l .
FILTER(REGEX(?l, “Woolworths*”)).
}

More generally, are the tweaks and tricks needed to optimise this sort of disambiguation going to be cross-domain, or do we have to hand-craft them, case by case?

WordPress trust syndication revisited: F2F plugin

This is a followup to my Syndicating trust? Mediawiki, WordPress and OpenID post. I now have a simple implementation that exports data from WordPress: the F2F plugin. Also some experiments with consuming aggregates of this information from multiple sources.

FOAF has always had a bias towards describing social things that are shown rather than merely stated; this is particularly so in matters of trust. One way of showing basic confidence in others, is by accepting their comments on your blog or Web site. F2F is an experiment in syndicating information about these kinds of everyday public events. With F2F, others can share and re-use this sort of information too; or deal with it in aggregate to spread the risk and bring more evidence into their trust-related decisions. Or they might just use it to find interesting people’s blogs.

OpenID is a technology that lets people authenticate by showing they control some URL. WordPress blogs that use the OpenID plugin slowly accumulate a catalogue of URLs when people leave comments that are approved or rejected. In my previous post I showed how I was using the list of approved OpenIDs from my blog to help configure the administrative groups on the FOAF wiki.

This may all raise more questions than it answers. What level of detail is appropriate? are numbers useful, or just lists? in what circumstances is it sensible or risky to merge such data? is there a reasonable use for both ‘accept’ lists and ‘unaccept’ lists? What can we do with a list of OpenID URLs once we’ve got it? How do we know when two bits of trust ‘evidence’ actually share a common source? How do we find this information from the homepage of a blog?

If you install the F2F plugin (and have been using the OpenID plugin long enough to have accumulated a database table of OpenIDs associated with submitted comments), you can experiment with this. Basically it will generate HTML in RDFa format describing a list of people . See the F2F Wiki page for details and examples.

The script is pretty raw, but today it all improved a fair bit with help from Ed Summers, Daniel Krech and Morten Frederiksen. Ed and Daniel helped me get started with consuming this RDFa and SPARQL in the latest version of the rdflib Python library. Morten rewrote my initial nasty hack, so that it used WordPress Shortcodes instead of hardcoding a URL path. This means that any page containing a certain string – f2f in chunky brackets – will get the OpenID list added to it. I’ll try that now, right here in this post. If it works, you’ll get a list of URLs below. Also thanks to Gerald Oskoboiny for discussions on this and reputation-related aggregation ideas; see his page on reputation and trust for lost more related ideas and sites. See also Peter Williams’ feedback on the foaf-dev list.

Next steps? I’d be happy to have a few more installations of this, to get some testbed data. Ideally from an overlapping community so the datasets are linked, though that’s not essential. Ed has a copy installed currently too. I’ll also update the scripts I use to manage the FOAF MediaWiki admin groups, to load data from RDFa blogs; mine and others if people volunteer relevant data. It would be great to have exports from other software too, eg. Drupal or MediaWiki.

Comment accept list for http://danbri.org/words

Quick clarification on SPARQL extensions and “Lock-in”

It’s clear from discussion bouncing around IRC, Twitter, Skype and elsewhere that “Lock-in” isn’t a phrase to use lightly.

So I post this to make myself absolutely clear. A few days ago I mentioned in IRC a concern that newcomers to SPARQL and RDF databases might not appreciate which SPARQL extensions are widely implemented, and which are the specialist offerings of the system they happen to be using. I mentioned OpenLink’s Virtuoso in particular as a SPARQL implementation that had a rich and powerful set of extensions.

Since it seems there is some risk I might be mis-interpreted as suggesting OpenLink are actively trying to “do a Microsoft” and trap users in some proprietary pseudo-SPARQL, I’ll state what I took to be obvious background knowledge: OpenLink is a company who owe their success to the promotion of cross-vendor database portability, they have been tireless advocates of a standards-based Semantic Web, and they’re active in proposing extensions to W3C for standardisation. So – no criticism of OpenLink intended. None at all.

All I think we need here, are a few utilities that help developers understand the nature of the various SPARQL dialects and the potential costs/benefits of using them. Perhaps an online validator, alongside those for RDF/XML, RDFa, Turtle etc. Such a validator might usefully list the extensions used in some query, and give pointers (perhaps into a wiki) where the status of the various extensions constructs can be discussed and documented.

Since SPARQL is such a young language, it lacks a lot of things that are taken from granted in the SQL world, and so using rich custom extensions when available is for many developers a sensible choice. My only concern is that it must be a choice, and one entered into consciously.

Rick Jelliffe on XML Schema

From the TAG list:

XML Schemas is like using a Swiss Army knife to cook with. Most Asian kitchens get by with a handful of simple tools: chopsticks, hatchet, a good knife, perhaps even a spoon. But the logic of  the XSD WG is “Oh, the French need to make quenelles, we must have a quenelling spoon as a grave matter of Internationalization because it is not our business to judge what people need… as long it is more stuff.”    So XSD 1.1 welds another Swiss Army knife onto the existing one, so that no kitchen should suffer without a quenelling spoon.

See also earlier comments on the Schema Experience Workshop from W3C.

So tool-makers blame users for generating non-standard schemas, and users blame the spec for being to difficult to know whether their schemas are standard or not, and spec makers blame tool makers for not implementing the spec properly. Who will free us from this cycle of sin and death?

[...] The only way that XML Schemas can be refactored is with a different core XML Schemas working group. My current expectation is that a lot of nothing will happen until XQuery/XSLT2 becomes seen as a more central technology than XML Schemas; the goal will then be how to support XQuery most minimally.

XSD doesn’t trouble me as much as it troubles Rick, but I have long sympathised with the approach he advocates with Schematron. The RDF equivalent of this is the approach Libby and I called “Schemarama”, expressing constraints against RDF instance data using queries. See original 2001 demo using SquishQL, and a later reworking by Alistair Miles using SPARQL (currently offline?). Recent work from the OWL experts at Clark & Parsia (blog post; another blog post) is heading in the same direction. I wonder whether Rick’s observation about XML applies to RDF too, and that at some point, SPARQL querying facilities will be so ubiquitous in RDF tools that it becomes second nature to apply it to data checking tasks too…?

Update: see also SpinRDF from Holger & co. at Top Quadrant

Skosdex: SKOS utilities via jruby

I just announced this on the public-esw-thes and public-rdf-ruby lists. I started to make a Ruby API for SKOS.

Example code snippet from the readme.txt (see that link for the corresponding output):

require "src/jena_skos"
s1 = SKOS.new("http://norman.walsh.name/knows/taxonomy")
s1.read("http://www.wasab.dk/morten/blog/archives/author/mortenf/skos.rdf" )
s1.read("file:samples/archives.rdf")
s1.concepts.each_pair do |url,c|
  puts "SKOS: #{url} label: #{c.prefLabel}"
end

c1 = s1.concepts["http://www.ukat.org.uk/thesaurus/concept/1366"] # Agronomy
puts "test concept is "+ c1 + " " + c1.prefLabel
c1.narrower do |uri|
  c2 = s1.concepts[uri]
  puts "\tnarrower: "+ c2 + " " + c2.prefLabel
  c2.narrower do |uri|
    c3 = s1.concepts[uri]
    puts "\t\tnarrower: "+ c3 + " " + c3.prefLabel
  end
end

The idea here is to have a lightweight OO API for SKOS, couched in terms of a network of linked “Concepts”, with broader and narrower relations. But this is backed by a full RDF API (in our case Jena, via Java jruby magic). Eventually, entire apps could be built at the SKOS API level. For now, anything beyond broader/narrower and prefLabel is hidden away in the RDF (and so you’d need to dip into the Jena API to get to this data).

The distinguishing feature is that it uses jruby (a Ruby implementation in pure Java). As such it can call on the full powers of the Jena toolkit, which go far beyond anything available currently in Ruby. At the moment it doesn’t do much, I just parse SKOS and make a tiny object model which exposes little more than prefLabel and broader/narrower.

I think it’s worth exploring because Ruby is rather nice for scripting, but lacks things like OWL reasoners and the general maturity of Java RDF/OWL tools (parsers, databases, etc.).

If you’re interested just to see how Jena APIs look when called from jruby Ruby, see jena_skos.rb in svn. Excuse the mess.

I’m interested to hear if anyone else has explored this topic. Obviously there is a lot more to SKOS than broader/narrower, so I’m very interested to find collaborators or at least a sanity check before taking this beyond a rough demo.

Plans – well my main concern is nothing to do with java or ruby, … but to explore Lucene indexing of SKOS data. I am also very interested in the pragmatic question of where SKOS stops and RDFS/OWL starts, … and how exactly we bridge that gap. See flickr for my most recent sketch of this landscape, where I revisit the idea of an “it” property (skos:it, foaf:it, …) that links things described in SKOS to “the thing itself”. I hope to load up enough overlapping SKOS data to get some practical experience with the tradeoffs.

For query expansion, smarter tagging assistants, etc. So the next step is probably to try building a Lucene index similar to the contrib/wordnet utility that ships with Java lucene. This creates a Lucene index in which every “document” is really a word from Wordnet, with text labels for its synonyms as indexed properties. I also hope to look at the use of SKOS + Lucene for “did you mean?” and auto-completion utilities. It’s also worth noting that Jena ships with LARQ, a Lucene-aware extension to ARQ, Jena’s SPARQL engine.

Mozilla Ubiquity

The are some interesting things going on at Mozilla Labs. Yesterday, Ubiquity was all over the mailing lists. You can think of it as “what the Humanized folks did next”, or as a commandline for the Web, or as a Webbier sibling to QuickSilver, the MacOSX utility. I prefer to think of it as the Mozilla add-on that distracted me all day. Ubiquity continues Mozilla’s exploration of the potential UI uses of its “awesome bar” (aka Location bar). Ubiquity is invoked on my Mac with alt-space, at which point it’ll enthusiastically try to autocomplete a verb-centric Webby task from whatever I type. It does this by consulting a pile of built-in and community-provided Javacript functions, which have access to the Web, your browser (hello, widget security fans)… and it also has access to UI, in terms of an overlaid preview window, as well as a context menu that can actually be genuinely contextual, ie. potentially sensitive to microformat and RDFa markup.

So it might help to think of ubiquity as a cross between The Hobbit, GreaseMonkeyBookmarklets, and Mozilla’s earlier forms of packaged addon. Ok, well it’s not very Hobbit, I just wanted an excuse for this screen grab. But it is about natural language interfaces to complex Webby datasources and services.

The basic idea here is that commands (triggered by some keyword) can be published in the Web as links to simple Javascript files that can be single-click added (without need for browser restart) by anyone trusting enough to add the code to their browser. Social/trust layers to help people avoid bad addons are in the works too.

I spent yesterday playing. There are some rough edges, but this is fun stuff for sure. The emphasis is on verbs, hence on doing, rather than solely on lookups, query and data access. Coupled with the dependency on third party Javascript, this is going to need some serious security attention. But but but… it’s so much fun to use and develop for. Something will shake out security-wise. Even if Ubiquity commands are only shared amongst trusting power users who have signed each other’s PGP keys, I think it’ll still have an important niche.

What did I make? A kind of stalk-a-tron, FOAF lookup tool. It currently only consults Google’s Social Graph API, an experimental service built from all the public FOAF and XFN on the Web plus some logic to figure out which account pages are held by the same person. My current demo simply retrieves associated URLs and photos, and displays them overlaid on the current page. If you can’t get it working via the Ubiquity auto-subscribe feature, try adding it by pasting the raw Javascript into the command-editor screen. See also the ‘sindice-term‘ lookup tool from Michael Hausenblas. It should be fun seeing how efforts like Bengee’s SPARQLScript work can be plugged in here, too.