Exploring Linked Data with Gremlin

Gremlin is a free Java/Groovy system for traversing graphs, including but not limited to RDF. This post is based on example code from Marko Rodriguez (@twarko) and the Gremlin wiki and mailing list. The test run below goes pretty slowly when run with 4 or 5 loops, since it uses the Web as its database, via entry-by-entry fetches. In this case it’s fetching from DBpedia, but I’ve ran it with a few tweaks against Freebase happily too. The on-demand RDF is handled by the Linked Data Sail developed by Joshua Shinavier; same thing would work directly against a database too. If you like Gremlin you’ll also Joshua’s Ripple work (see screencast, code, wiki).

Why is this interesting? Don’t we already have SPARQL? And SPARQL 1.1 even has paths.  I’d like to see a bit more convergence with SPARQL, but this is a different style of dealing with graph data. The most intriguing difference from SPARQL here is the ability to drop in Turing-complete fragments throughout the ‘query'; for example in the { closure block } shown below. I’m also, for hard-to-articulate reasons, reminded somehow of Apache’s Pig language. Although Pig doesn’t allow arbitrary script, it does encourage a pipeline perspective on data processing.

So in this example we start exploring the graph from one vertex, we’ll call it ‘fry’, representing Stephen Fry’s dbpedia entry. The idea is to collect up information about actors and their co-starring patterns as recorded in Wikipedia.

Here is the full setup code needed; it can be run interactively in the Gremlin commandline console. So it runs quickly we loop only twice.

g = new LinkedDataSailGraph(new MemoryStoreSailGraph())
fry = g.v(‘http://dbpedia.org/resource/Stephen_Fry’)
g.addNamespace(‘wp’, ‘http://dbpedia.org/ontology/’)
m = [:]

From here, everything else is in one line:
fry.in(‘wp:starring’).out(‘wp:starring’).groupCount(m).loop(3){it.loops <2}

This corresponds to a series of steps (which map to TinkerPop / Blueprints / Pipes API calls behind the scenes). I’ve marked the steps in bold here:

  • in: ‘wp:starring': from our initial vertice, representing Stephen Fry, we step to vertices that point to us with a ‘wp:starring’ link
  • out: from those vertices, we follow outgoing edges marked ‘wp:starring’ (including those back to Stephen Fry), taking us to things that he and his co-stars starred in, i.e. TV shows and films.
  • we then call groupCount and pass it our bookkeeping hashtable, ‘m’. It increments a counter based on ID of current vertex or edge. As we revisit the same vertex later, the total counter for that entity goes up.
  • from this point, we then go back 3 steps, and recurse several times.  e.g. “{ it.loops < 3 }” (this last is a closure; we can drop any code in here…)

This maybe gives a flavour. See the Gremlin Wiki for the real goods. The first version of this post was verbose, as I had Gremlin step explictly into graph edges, and back into vertices each time. Gremlin allows edges to have properties, which is useful both for representing non-RDF data, but also for apps to keep annotations on RDF triples. It also exposes ‘named graph’ URIs on each edge with an ‘ng’ property. You can step from a vertex into an edge with ‘inE’, ‘outE’ and other steps; again check the wiki for details.

From an application and data perspective, the Gremlin system is interesting as it allows quantitatively minded graph explorations to be used alongside classically factual SPARQL. The results below show that it can dig out an actor’s co-stars (and then take account of their co-stars, and so on). This sort of neighbourhood exploration helps balance out the messyness of much Linked Data; rather than relying on explicitly asserted facts from the dataset, we can also add in derived data that comes from counting things expressed in dozens or hundreds of pages.

Once the Gremlin loops are finished, we can examine the state of our book-keeping object, ‘m':

Back in the gremlin.sh commandline interface (effectively typing in Groovy) we can do this…

gremlin> m2 = m.sort{ a,b -> b.value <=> a.value }

==>v[http://dbpedia.org/resource/Stephen_Fry]=58
==>v[http://dbpedia.org/resource/Hugh_Laurie]=9
==>v[http://dbpedia.org/resource/Tony_Robinson]=6
==>v[http://dbpedia.org/resource/Rowan_Atkinson]=6
==>v[http://dbpedia.org/resource/Miranda_Richardson]=4
==>v[http://dbpedia.org/resource/Tim_McInnerny]=4
==>v[http://dbpedia.org/resource/Tony_Slattery]=3
==>v[http://dbpedia.org/resource/Emma_Thompson]=3
==>v[http://dbpedia.org/resource/Robbie_Coltrane]=3
==>v[http://dbpedia.org/resource/John_Lithgow]=2
==>v[http://dbpedia.org/resource/Emily_Watson]=2
==>v[http://dbpedia.org/resource/Colin_Firth]=2
==>v[http://dbpedia.org/resource/Sandi_Toksvig]=1
==>v[http://dbpedia.org/resource/John_Sessions]=1
==>v[http://dbpedia.org/resource/Greg_Proops]=1
==>v[http://dbpedia.org/resource/Paul_Merton]=1
==>v[http://dbpedia.org/resource/Mike_McShane]=1
==>v[http://dbpedia.org/resource/Ryan_Stiles]=1
==>v[http://dbpedia.org/resource/Colin_Mochrie]=1
==>v[http://dbpedia.org/resource/Josie_Lawrence]=1
[...]

Now how would this look if we looped around a few more times? i.e. re ran our co-star traversal from each of the final vertices we settled on?
Here are the results from a longer run. The difference you see will depend upon the shape of the graph, the kind of link types you’re traversing, and so forth. And also, of course, on the nature of the things in the world that the graph describes. Here are the Gremlin results when we loop 5 times instead of 2:

==>v[http://dbpedia.org/resource/Stephen_Fry]=8160
==>v[http://dbpedia.org/resource/Hugh_Laurie]=3641
==>v[http://dbpedia.org/resource/Rowan_Atkinson]=2481
==>v[http://dbpedia.org/resource/Tony_Robinson]=2168
==>v[http://dbpedia.org/resource/Miranda_Richardson]=1791
==>v[http://dbpedia.org/resource/Tim_McInnerny]=1398
==>v[http://dbpedia.org/resource/Emma_Thompson]=1307
==>v[http://dbpedia.org/resource/Robbie_Coltrane]=1303
==>v[http://dbpedia.org/resource/Tony_Slattery]=911
==>v[http://dbpedia.org/resource/Colin_Firth]=854
==>v[http://dbpedia.org/resource/John_Lithgow]=732 [...]

Skosdex: SKOS utilities via jruby

I just announced this on the public-esw-thes and public-rdf-ruby lists. I started to make a Ruby API for SKOS.

Example code snippet from the readme.txt (see that link for the corresponding output):

require "src/jena_skos"
s1 = SKOS.new("http://norman.walsh.name/knows/taxonomy")
s1.read("http://www.wasab.dk/morten/blog/archives/author/mortenf/skos.rdf" )
s1.read("file:samples/archives.rdf")
s1.concepts.each_pair do |url,c|
  puts "SKOS: #{url} label: #{c.prefLabel}"
end

c1 = s1.concepts["http://www.ukat.org.uk/thesaurus/concept/1366"] # Agronomy
puts "test concept is "+ c1 + " " + c1.prefLabel
c1.narrower do |uri|
  c2 = s1.concepts[uri]
  puts "\tnarrower: "+ c2 + " " + c2.prefLabel
  c2.narrower do |uri|
    c3 = s1.concepts[uri]
    puts "\t\tnarrower: "+ c3 + " " + c3.prefLabel
  end
end

The idea here is to have a lightweight OO API for SKOS, couched in terms of a network of linked “Concepts”, with broader and narrower relations. But this is backed by a full RDF API (in our case Jena, via Java jruby magic). Eventually, entire apps could be built at the SKOS API level. For now, anything beyond broader/narrower and prefLabel is hidden away in the RDF (and so you’d need to dip into the Jena API to get to this data).

The distinguishing feature is that it uses jruby (a Ruby implementation in pure Java). As such it can call on the full powers of the Jena toolkit, which go far beyond anything available currently in Ruby. At the moment it doesn’t do much, I just parse SKOS and make a tiny object model which exposes little more than prefLabel and broader/narrower.

I think it’s worth exploring because Ruby is rather nice for scripting, but lacks things like OWL reasoners and the general maturity of Java RDF/OWL tools (parsers, databases, etc.).

If you’re interested just to see how Jena APIs look when called from jruby Ruby, see jena_skos.rb in svn. Excuse the mess.

I’m interested to hear if anyone else has explored this topic. Obviously there is a lot more to SKOS than broader/narrower, so I’m very interested to find collaborators or at least a sanity check before taking this beyond a rough demo.

Plans – well my main concern is nothing to do with java or ruby, … but to explore Lucene indexing of SKOS data. I am also very interested in the pragmatic question of where SKOS stops and RDFS/OWL starts, … and how exactly we bridge that gap. See flickr for my most recent sketch of this landscape, where I revisit the idea of an “it” property (skos:it, foaf:it, …) that links things described in SKOS to “the thing itself”. I hope to load up enough overlapping SKOS data to get some practical experience with the tradeoffs.

For query expansion, smarter tagging assistants, etc. So the next step is probably to try building a Lucene index similar to the contrib/wordnet utility that ships with Java lucene. This creates a Lucene index in which every “document” is really a word from Wordnet, with text labels for its synonyms as indexed properties. I also hope to look at the use of SKOS + Lucene for “did you mean?” and auto-completion utilities. It’s also worth noting that Jena ships with LARQ, a Lucene-aware extension to ARQ, Jena’s SPARQL engine.

Beautiful plumage: Topic Maps Not Dead Yet

Echoing recent discussion of Semantic Web “Killer Apps”, an “are Topic Maps dead?” thread on the topicmaps mailing list. Signs of life offered include www.fuzzzy.com (‘Collaborative, semantic and democratic social bookmarking’, Topic Maps meet social networking; featured tag: ‘topic maps‘) and a longer-list from Are Gulbrandsen who suggests a predictable hype-cycle dropoff is occuring, as well as a migration of discussions from email into the blog world. For which, see the topicmaps planet aggregator, and through which I indirectly find Steve Pepper’s blog and an interesting post on how TMs relate to RDF, OWL and the Semantic Web (though I’d have hoped for some mention of SKOS too).

Are Gulbrandsen also cites NZETC (the New Zealand Electronic Tech Centre), winner of The Topic Maps Application of the year award at the Topic Maps 2008 conference; see Conal Tuohy’s presentation on Topic Maps for Cultural Heritage Collections (slides in PDF). On NZETC’s work: “It may not look that interesting to many people used to flashy web 2.0 sites, but to anybody who have been looking at library systems it’s a paradigm shift“.

Other Topic Map work highlighted: RAMline (Royal Academy of Music rewriting musical history). “A long-term research project into the mapping of three axes of musical time: the historical, the functional, and musical time itself.”; David Weinberger blogged about this work recently. Also MIPS / Institute for Bioinformatics and Systems Biology who “attempt to explain the complexity of life with Topic Maps” (see presentation from Volker Stümpflen (PDF); also a TMRA’07 talk).

Finally, pointers to opensource developer tools: Ruby Topic Maps and Wandora (Java/GPL), an extraction/mapping and publishing system which amongst other things can import RDF.

Topic Maps are clearly not dead, and the Web’s a richer environment because of this. They may not have set the world on fire but people are finding value in the specs and tools, while also exploring interop with RDF and other related technologies. My hunch is that we’ll continue to see a slow drift towards the use of RDF/OWL plus SKOS for apps that might otherwise have been addressed using TopicMaps, and a continued pragmatism from tool and app developers who see all these things as ways to solve problems, rather than as ends in themselves.

Just as with RDFa, GRDDL and Microformats, it is good and healthy for the Web community to be exploring multiple similar strands of activity. We’re smart enough to be able to flow data across these divides when needed, and having only a single technology stack is I think both intellectually limiting, socially impractical, and technologically short-sighted.

JQbus: social graph query with XMPP/SPARQL

Righto, it’s about time I wrote this one up. One of my last deeds at W3C before leaving at the end of 2005, was to begin the specification of an XMPP binding of the SPARQL querying protocol. For the acronym averse, a quick recap. XMPP is the name the IETF give to the Jabber messaging technology. And SPARQL is W3C’s RDF-based approach to querying mixed-up Web data. SPARQL defines a textual query language, an XML result-set format, and a JSON version for good measure. There is also a protocol for interacting with SPARQL databases; this defines an abstract interface, and a binding to HTTP. There is as-yet no official binding to XMPP/Jabber, and existing explorations are flawed. But I’ll argue here, the work is well worth completing.

jqbus diagram

So what do we have so far? Back in 2005, I was working in Java, Chris Schmidt in Python, and Steve Harris in Perl. Chris has a nice writeup of one of the original variants I’d proposed, which came out of my discussions with Peter St Andre. Chris also beat me in the race to have a working implementation, though I’ll attribute this to Python’s advantages over Java ;)

I won’t get bogged down in the protocol details here, except to note that Peter advised us to use IQ stanzas. That existing work has a few slight variants on the idea of sending a SPARQL query in one IQ packet, and returning all the results within another, and that this isn’t quite deployable as-is. When the result set is too big, we can run into practical (rather than spec-mandated) limits at the server-to-server layers. For example, Peter mentioned that jabber.org had a 65k packet limit. At SGFoo last week, someone suggested sending the results as an attachment instead; apparently this one of the uncountably many extension specs produced by the energetic Jabber community. The 2005 work was also somewhat partial, and didn’t work out the detail of having a full binding (eg. dealing with default graphs, named graphs etc).

That said, I think we’re onto something good. I’ll talk through the Java stuff I worked on, since I know it best. The code uses Ignite Online’s Smack API. I have published rough Java code that can communicate with instances of itself across Jabber. This was last updated July 2007, when I fixed it up to use more recent versions of Smack and Jena. I forget if the code to parse out query results from the responses was completed, but it does at least send SPARQL XML results back through the XMPP network.

sparql jabber interaction

sparql jabber interaction

So why is this interesting?

  • SPARQLing over XMPP can cut through firewalls/NAT and talk to data on the desktop
  • SPARQLing over XMPP happens in a social environment; queries are sent from and to Jabber addresses, while roster information is available which could be used for access control at various levels of granularity
  • XMPP is well suited for async interaction; stored queries could return results days or weeks later (eg. job search)
  • The DISO project is integrating some PHP XMPP code with WordPress; SparqlPress is doing same with SPARQL

Both SPARQL and XMPP have mechanisms for batched results, it isn’t clear which, if either, to use here.

XMPP also has some service discovery mechanisms; I hope we’ll wire up a way to inspect each party on a buddylist roster, to see who has SPARQL data available. I made a diagram of this last summer, but no code to go with it yet. There is also much work yet to do on access control systems for SPARQL data, eg. using oauth. It is far from clear how to integrate SPARQL-wide ideas on that with the specific possibilities offered within an XMPP binding. One idea is for SPARQL-defined FOAF groups to be used to manage buddylist rosters, “friend groups”.

Where are we with code? I have a longer page in FOAF SVN for the Jqbus Java stuff, and a variant on this writeup (includes better alt text for the images and more detail). The Java code is available for download. Chris’s Python code is still up on his site. I doubt any of these can currently talk to each other properly, but they show how to deal with XMPP in different languages, which is useful in itself. For Perl people, I’ve uploaded a copy of Steve’s code.

The Java stuff has a nice GUI for debugging, thanks to Smack. I just tried a copy. Basically I can run a client and a server instance from the same filetree, passing it my LiveJournal and Google Talk jabber account details. The screenshot here shows the client on the left having the XML-encoded SPARQL results, and the server on the right displaying the query that arrived. That’s about it really. Nothing else ought to be that different from normal SPARQL querying, except that it is being done in an infrastructure that is more socially-grounded and decentralised than the classic HTTP Web service model.

JQbus debug

Venn diagrams for Groups UI

Venn groups diagram

This is the result of feeding a relatively small list of groups and their members to the VennMaster java tool. I’ve been looking for swooshy automatic layout tools that might help with interactive visualisation of ‘social graph’ data where people can be clustered by their membership of various groups. I also wanted to explore the possibility of using such a tool as a way of authoring filters from raw evidence, using groups such as “have sent mail to”, “have accepted blog/wiki comment from”.

My gut reaction from this quick experiment is that the UI space is very easily overwhelmed. I used here just a quick hand-coded list of people, in fairly ad-hoc groups (cities, current-and-former workplaces etc.). Real data has more people and more groups. Still, I think there may be something worth investigating here. The venn tool I found was really designed for lifesci data, not people. If anyone knows of other possible software to try here, do let me know. To try this tool, simply run the Java app from the commandline, and use “File >> OpenList” on a file such as people.list.

One other thing I noticed in creating these ad-hoc groups (more or less ‘people tags’), is that representing what people have done felt intuitively as important as what they’re doing right now. For example, places people once lived or worked. This gives another axis of complexity that might need visualising. I’d like the underlying data to know who currently works/lives somewhere, versus “used to”, but in some views the two might appropriately be folded together. Tricky.

Imagemap magic

I’ve always found HTML imagemaps to be a curiously neglected technology. They seem somehow to evoke the Web of the mid-to-late 90s, to be terribly ‘1.0’. But there’s glue in the old horse yet…

A client-side HTML imagemap lets you associate links (and via Javascript, behaviour) with regions of an image. As such, they’re a form of image metadata that can have applications including image search, Web accessibility and social networking. They’re also a poor cousin to the Web’s new vector image format, SVG. This morning I dug out some old work on this (much of which from Max, Libby, Jim all of whom btw are currently working at Joost; as am I, albeit part-time).

The first hurdle you hit when you want to play with HTML imagemaps is finding an editor that produces them. The fact that my blog post asking for MacOSX HTML imagemap editors is now top Google hit for “MacOSX HTML imagemap” pretty much says it all. Eventually I found (and paid for) one called YokMak that seems OK.

So the first experiment here, was to take a picture (of me) and make a simple HTML imagemap.

danbri being imagemapped

As a step towards treating this as re-usable metadata, here’s imagemap2svg.xslt from Max back in 2002. The results of running it with xsltproc are online: _output.svg (you need an SVG-happy browser). Firefox, Safari and Opera seem more or less happy with it (ie. they show the selected area against a pink background). This shows that imagemap data can be freed from the clutches of HTML, and repurposed. You can do similar things server-side using Apache Batik, a Java SVG toolkit. There are still a few 2002 examples floating around, showing how bits of the image can be described in RDF that includes imagemap info, and then manipulated using SVG tools driven from metadata.

Once we have this ability to pick out a region of an image (eg. photo) and tag it, it opens up a few fun directions. In the FOAF scene a few years ago, we had fun using RDF to tag image region parts with information about the things they depicted. But we didn’t really get into questions of surface-syntax, ie. how to maker rich claims about the image area directly within the HTML markup. These days, some combination of RDFa or microformats would probably be the thing to use (or perhaps GRDDL). I’ve sent mail to the RDFa group looking for help with this (see that message for various further related-work links too).

Specifically, I’d love to have some clean HTML markup that said, not just “this area of the photo is associated with the URI http://danbri.org/”, but “this area is the Person whose openid is danbri.org, … and this area depicts the thing that is the primary topic of http://en.wikipedia.org/wiki/Eiffel_Tower”. If we had this, I think we’d have some nice tools for finding images, for explaining images to people who can’t see them, and for connecting people and social networks through codepiction.

Codepiction

SQL, OpenOffice: would a JDBC driver for SPARQL protocol make sense?

I’d like to have access to my SPARQL data from desktop tools like OpenOffice’s spreadsheet. Apparently it can talk to remote databases via JDBC and OBDBC, both for the spreadsheet and database tools within OpenOffice.

Many years ago, pre-SPARQL, I hassled Libby into making her Squish RDF query system talk JDBC. There have been related efforts since, but I’ve lost track of what’s currently feasible with modern tools. There was a SPARQL4J collaboration initially between Asemantics, Profium and HP Labs, and some JBDC code from 2005 is on sourceforge. SPARQL inside JDBC is always going to be a bit of a hack, I suspect, since SPARQL results aren’t exactly like SQL result. But if it’s the easiest way to get things into a spreadsheet, is probably worth pursuing. I suspect this is the sort of problem OpenLink are good at, given their database driver background.

Would it be possible to wire up OpenOffice somehow so that it thought it was talking JDBC, but the connection parameters told the Java driver about a SPARQL endpoint URLs. Actually all this is a bit predicated on the assumption that OpenOffice allows the app user to pass in SQL, so we could pass SPARQL instead. If all the SQL is generated by OpenOffice we’d need to hack the OO src. Amy I making any sense? What still needs to be done? Where are we today?

Amazon EC2: My 131 cents

Early this afternoon, I got it into my head to try out Amazon’s “Elastic Compute Cloud” (EC2) service. I’m quite impressed.

The bill so far, after some playing around, is 1.31 USD. I’ve had the default “getting started” Fedora Linux box up and running for maybe 7 hours, as well as trying out machine images (AMIs) preconfigured for Virtuoso data spaces, and for Ruby on Rails. Being familiar with neither, I’m impressed by the fact that I can rehydrate a pre-prepared Linux machine configured for these apps, simply by clicking a button in the EC2 Firefox addon. Nevertheless I managed to get bogged down with both toolkits; wasn’t in an RTFM mood, and even Rails can be fiddly if you’re me.

I’ve done nothing very compute or bandwidth intensive yet; am sure the costs could crank up a bit if I started web crawling and indexing. But it is an impressive setup, and one you can experiment with easily and cheaply, especially if you’ve already got an Amazon account. Also, being billed in USD is always cheering. The whole thing is controlled through Web service interfaces, which are hidden from me since I use either the Firefox addon, or else the commandline tools, which work fine on MacOSX once you’ve set up some paths and env variables.

I can now just type “ec2-run-instances ami-e2ca2f8b -k sandbox-keypair” or similar (the latter identifies a public key to install in the server), to get a new Linux machine setup within a couple minutes, pre-configured in this case with Virtuoso Data Spaces. And since the whole environment is therefore scriptable, virtual machines can beget virtual machines. Zero-click purchase – just add money :)

So obviously I’m behind on the trends here, but hey, it’s never too late. The initial motivate for checking out EC2 was a mild frustration with the DreamHost setup I’ve been using for personal and FOAF stuff. No huge complaints, just that I was cheap and signed up for shared-box access, and it’s kind of hard not having root access after 12 years or so of having recorse to superpowers when needed. Also DreamHost don’t do Java, which is midly annoying. Some old FOAF stuff of Libby’s is in Java, and of course it’d be good to be able to make more use of Jena.

As I type this, I’m sat across the room from an ageing linux box (with broken CPU fan), taking up space and bringing a tangle of cables to my rather modest living room. I had been thinking about bringing it back to life as a dev box, since I’m otherwise working only on Mac and WinXP laptops. Having tried EC2, I see no reason to clutter my home with it. I’d like to see a stronger story from Amazon about backing up EC2 instances, but … well I’ll like to see a stronger story from me on that topic too :)