Exploring Linked Data with Gremlin

Gremlin is a free Java/Groovy system for traversing graphs, including but not limited to RDF. This post is based on example code from Marko Rodriguez (@twarko) and the Gremlin wiki and mailing list. The test run below goes pretty slowly when run with 4 or 5 loops, since it uses the Web as its database, via entry-by-entry fetches. In this case it’s fetching from DBpedia, but I’ve ran it with a few tweaks against Freebase happily too. The on-demand RDF is handled by the Linked Data Sail developed by Joshua Shinavier; same thing would work directly against a database too. If you like Gremlin you’ll also Joshua’s Ripple work (see screencast, code, wiki).

Why is this interesting? Don’t we already have SPARQL? And SPARQL 1.1 even has paths.  I’d like to see a bit more convergence with SPARQL, but this is a different style of dealing with graph data. The most intriguing difference from SPARQL here is the ability to drop in Turing-complete fragments throughout the ‘query'; for example in the { closure block } shown below. I’m also, for hard-to-articulate reasons, reminded somehow of Apache’s Pig language. Although Pig doesn’t allow arbitrary script, it does encourage a pipeline perspective on data processing.

So in this example we start exploring the graph from one vertex, we’ll call it ‘fry’, representing Stephen Fry’s dbpedia entry. The idea is to collect up information about actors and their co-starring patterns as recorded in Wikipedia.

Here is the full setup code needed; it can be run interactively in the Gremlin commandline console. So it runs quickly we loop only twice.

g = new LinkedDataSailGraph(new MemoryStoreSailGraph())
fry = g.v(‘http://dbpedia.org/resource/Stephen_Fry’)
g.addNamespace(‘wp’, ‘http://dbpedia.org/ontology/’)
m = [:]

From here, everything else is in one line:
fry.in(‘wp:starring’).out(‘wp:starring’).groupCount(m).loop(3){it.loops <2}

This corresponds to a series of steps (which map to TinkerPop / Blueprints / Pipes API calls behind the scenes). I’ve marked the steps in bold here:

  • in: ‘wp:starring': from our initial vertice, representing Stephen Fry, we step to vertices that point to us with a ‘wp:starring’ link
  • out: from those vertices, we follow outgoing edges marked ‘wp:starring’ (including those back to Stephen Fry), taking us to things that he and his co-stars starred in, i.e. TV shows and films.
  • we then call groupCount and pass it our bookkeeping hashtable, ‘m’. It increments a counter based on ID of current vertex or edge. As we revisit the same vertex later, the total counter for that entity goes up.
  • from this point, we then go back 3 steps, and recurse several times.  e.g. “{ it.loops < 3 }” (this last is a closure; we can drop any code in here…)

This maybe gives a flavour. See the Gremlin Wiki for the real goods. The first version of this post was verbose, as I had Gremlin step explictly into graph edges, and back into vertices each time. Gremlin allows edges to have properties, which is useful both for representing non-RDF data, but also for apps to keep annotations on RDF triples. It also exposes ‘named graph’ URIs on each edge with an ‘ng’ property. You can step from a vertex into an edge with ‘inE’, ‘outE’ and other steps; again check the wiki for details.

From an application and data perspective, the Gremlin system is interesting as it allows quantitatively minded graph explorations to be used alongside classically factual SPARQL. The results below show that it can dig out an actor’s co-stars (and then take account of their co-stars, and so on). This sort of neighbourhood exploration helps balance out the messyness of much Linked Data; rather than relying on explicitly asserted facts from the dataset, we can also add in derived data that comes from counting things expressed in dozens or hundreds of pages.

Once the Gremlin loops are finished, we can examine the state of our book-keeping object, ‘m':

Back in the gremlin.sh commandline interface (effectively typing in Groovy) we can do this…

gremlin> m2 = m.sort{ a,b -> b.value <=> a.value }

==>v[http://dbpedia.org/resource/Stephen_Fry]=58
==>v[http://dbpedia.org/resource/Hugh_Laurie]=9
==>v[http://dbpedia.org/resource/Tony_Robinson]=6
==>v[http://dbpedia.org/resource/Rowan_Atkinson]=6
==>v[http://dbpedia.org/resource/Miranda_Richardson]=4
==>v[http://dbpedia.org/resource/Tim_McInnerny]=4
==>v[http://dbpedia.org/resource/Tony_Slattery]=3
==>v[http://dbpedia.org/resource/Emma_Thompson]=3
==>v[http://dbpedia.org/resource/Robbie_Coltrane]=3
==>v[http://dbpedia.org/resource/John_Lithgow]=2
==>v[http://dbpedia.org/resource/Emily_Watson]=2
==>v[http://dbpedia.org/resource/Colin_Firth]=2
==>v[http://dbpedia.org/resource/Sandi_Toksvig]=1
==>v[http://dbpedia.org/resource/John_Sessions]=1
==>v[http://dbpedia.org/resource/Greg_Proops]=1
==>v[http://dbpedia.org/resource/Paul_Merton]=1
==>v[http://dbpedia.org/resource/Mike_McShane]=1
==>v[http://dbpedia.org/resource/Ryan_Stiles]=1
==>v[http://dbpedia.org/resource/Colin_Mochrie]=1
==>v[http://dbpedia.org/resource/Josie_Lawrence]=1
[...]

Now how would this look if we looped around a few more times? i.e. re ran our co-star traversal from each of the final vertices we settled on?
Here are the results from a longer run. The difference you see will depend upon the shape of the graph, the kind of link types you’re traversing, and so forth. And also, of course, on the nature of the things in the world that the graph describes. Here are the Gremlin results when we loop 5 times instead of 2:

==>v[http://dbpedia.org/resource/Stephen_Fry]=8160
==>v[http://dbpedia.org/resource/Hugh_Laurie]=3641
==>v[http://dbpedia.org/resource/Rowan_Atkinson]=2481
==>v[http://dbpedia.org/resource/Tony_Robinson]=2168
==>v[http://dbpedia.org/resource/Miranda_Richardson]=1791
==>v[http://dbpedia.org/resource/Tim_McInnerny]=1398
==>v[http://dbpedia.org/resource/Emma_Thompson]=1307
==>v[http://dbpedia.org/resource/Robbie_Coltrane]=1303
==>v[http://dbpedia.org/resource/Tony_Slattery]=911
==>v[http://dbpedia.org/resource/Colin_Firth]=854
==>v[http://dbpedia.org/resource/John_Lithgow]=732 [...]

Opening and closing like flowers (social platform roundupathon)

Closing some tabs…

Stephen Fry writing on ‘social network’ sites back in January (also in the Guardian):

…what an irony! For what is this much-trumpeted social networking but an escape back into that world of the closed online service of 15 or 20 years ago? Is it part of some deep human instinct that we take an organism as open and wild and free as the internet, and wish then to divide it into citadels, into closed-border republics and independent city states? The systole and diastole of history has us opening and closing like a flower: escaping our fortresses and enclosures into the open fields, and then building hedges, villages and cities in which to imprison ourselves again before repeating the process once more. The internet seems to be following this pattern.

How does this help us predict the Next Big Thing? That’s what everyone wants to know, if only because they want to make heaps of money from it. In 1999 Douglas Adams said: “Computer people are the last to guess what’s coming next. I mean, come on, they’re so astonished by the fact that the year 1999 is going to be followed by the year 2000 that it’s costing us billions to prepare for it.”

But let the rise of social networking alert you to the possibility that, even in the futuristic world of the net, the next big thing might just be a return to a made-over old thing.

McSweenys:

Dear Mr. Zuckerberg,

After checking many of the profiles on your website, I feel it is my duty to inform you that there are some serious errors present. [...]

Lest-we-forget. AOL search log privacy goofup from 2006:

No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men” to “dog that urinates on everything.”

And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.”

It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. “Those are my searches,” she said, after a reporter read part of the list to her.

Time magazine punditising on iGoogle, Facebook and OpenSocial:

Google, which makes its money on a free and open Web, was not happy with the Facebook platform. That’s because what happens on Facebook stays on Facebook. Google would much prefer that you come out and play on its platform — the wide-open Web. Don’t stay behind Facebook’s closed doors! Hie thee to the Web and start searching for things. That’s how Google makes its money.

So, last fall, Google rallied all the other major social networks (MySpace, Bebo, Hi5 and so on) and announced a new initiative called OpenSocial. OpenSocial wants to be like Facebook’s platform, only much bigger: Widget makers can write applications for it and they can run anywhere — on MySpace, Bebo and Google’s own social network, Orkut, which is very big in Brazil.

Google’s platform could actually dwarf Facebook — if it ever gets off the ground.

Meanwhile on the widget and webapp security front, we have “BBC exposes Facebook flaw” (information about your buddies is accessible to apps you install; information about you is accessible to apps they install). Also see Thomas Roessler’s comments to my Nokiana post for links to a couple of great presentations he made on widget security. This includes a big oopsie with the Google Mail widget for MacOSX. Over in Ars Technica we learn that KDE 4.1 alpha 1 now has improved widget powers, including “preliminary support for SuperKaramba and Mac OS X Dashboard widgets“. Wonder if I can read my Gmail there…

As Stephen Fry says,  these things are “opening and closing like a flower”. The big hosted social sites have a certain oversimplifying retardedness about them. But the ability for code to go visit data (the widget/gadget model), is I think as valid as the opendata model where data flows around to visit code. I am optimistic that good things will come out of this ferment.

A few weeks ago I had the pleasure of meeting several of the Google OpenSocial crew in London. They took my grumbling about accessibility issues pretty well, and I hope to continue that conversation. Industry politics and punditry aside, I’m impressed with their professionalism and with the tie-in to an opensource implementation through Apache’s ShinDig project. The OpenSocial specs list is open to the public, where Cassie has just announced that “all 0.8 opensocial and gadgets spec changes have been resolved” (after a heroic slog through the issue list). I’m barely tracking the detail of discussion there, things are moving fast. There’s now a proposed REST API, for example; and I learned in London about plans for a formatting/templating system, which might be one mechanism for getting FOAF/RDF out of OpenSocial containers.

If OpenSocial continues to grow and gather opensource mindshare, it’s possible Facebook will throw some chunks of their platform over the wall (ie. “do an Adobe“). And it’ll probably be left to W3C to clean up the ensuring mess and fragmentation, but I guess that’s what they’re there for. Meanwhile there’s plenty yet to be figured out, … I think we’re in a pre-standards experimentation phase, regardless of how stable or mature we’re told these platforms are.

The fundamental tension here is that we want open data, open platforms, … for data and code to flow freely, but to protect the privacy, lives and blushes of those it describes. A tricky balance. Don’t let anyone tell you it’s easy, that we’ve got it figured out, or that all we need to do is “tear down the walls”.

Opening and closing like flowers…