Exploring Linked Data with Gremlin

Gremlin is a free Java/Groovy system for traversing graphs, including but not limited to RDF. This post is based on example code from Marko Rodriguez (@twarko) and the Gremlin wiki and mailing list. The test run below goes pretty slowly when run with 4 or 5 loops, since it uses the Web as its database, via entry-by-entry fetches. In this case it’s fetching from DBpedia, but I’ve ran it with a few tweaks against Freebase happily too. The on-demand RDF is handled by the Linked Data Sail developed by Joshua Shinavier; same thing would work directly against a database too. If you like Gremlin you’ll also Joshua’s Ripple work (see screencast, code, wiki).

Why is this interesting? Don’t we already have SPARQL? And SPARQL 1.1 even has paths.  I’d like to see a bit more convergence with SPARQL, but this is a different style of dealing with graph data. The most intriguing difference from SPARQL here is the ability to drop in Turing-complete fragments throughout the ‘query'; for example in the { closure block } shown below. I’m also, for hard-to-articulate reasons, reminded somehow of Apache’s Pig language. Although Pig doesn’t allow arbitrary script, it does encourage a pipeline perspective on data processing.

So in this example we start exploring the graph from one vertex, we’ll call it ‘fry’, representing Stephen Fry’s dbpedia entry. The idea is to collect up information about actors and their co-starring patterns as recorded in Wikipedia.

Here is the full setup code needed; it can be run interactively in the Gremlin commandline console. So it runs quickly we loop only twice.

g = new LinkedDataSailGraph(new MemoryStoreSailGraph())
fry = g.v(‘http://dbpedia.org/resource/Stephen_Fry’)
g.addNamespace(‘wp’, ‘http://dbpedia.org/ontology/’)
m = [:]

From here, everything else is in one line:
fry.in(‘wp:starring’).out(‘wp:starring’).groupCount(m).loop(3){it.loops <2}

This corresponds to a series of steps (which map to TinkerPop / Blueprints / Pipes API calls behind the scenes). I’ve marked the steps in bold here:

  • in: ‘wp:starring': from our initial vertice, representing Stephen Fry, we step to vertices that point to us with a ‘wp:starring’ link
  • out: from those vertices, we follow outgoing edges marked ‘wp:starring’ (including those back to Stephen Fry), taking us to things that he and his co-stars starred in, i.e. TV shows and films.
  • we then call groupCount and pass it our bookkeeping hashtable, ‘m’. It increments a counter based on ID of current vertex or edge. As we revisit the same vertex later, the total counter for that entity goes up.
  • from this point, we then go back 3 steps, and recurse several times.  e.g. “{ it.loops < 3 }” (this last is a closure; we can drop any code in here…)

This maybe gives a flavour. See the Gremlin Wiki for the real goods. The first version of this post was verbose, as I had Gremlin step explictly into graph edges, and back into vertices each time. Gremlin allows edges to have properties, which is useful both for representing non-RDF data, but also for apps to keep annotations on RDF triples. It also exposes ‘named graph’ URIs on each edge with an ‘ng’ property. You can step from a vertex into an edge with ‘inE’, ‘outE’ and other steps; again check the wiki for details.

From an application and data perspective, the Gremlin system is interesting as it allows quantitatively minded graph explorations to be used alongside classically factual SPARQL. The results below show that it can dig out an actor’s co-stars (and then take account of their co-stars, and so on). This sort of neighbourhood exploration helps balance out the messyness of much Linked Data; rather than relying on explicitly asserted facts from the dataset, we can also add in derived data that comes from counting things expressed in dozens or hundreds of pages.

Once the Gremlin loops are finished, we can examine the state of our book-keeping object, ‘m':

Back in the gremlin.sh commandline interface (effectively typing in Groovy) we can do this…

gremlin> m2 = m.sort{ a,b -> b.value <=> a.value }


Now how would this look if we looped around a few more times? i.e. re ran our co-star traversal from each of the final vertices we settled on?
Here are the results from a longer run. The difference you see will depend upon the shape of the graph, the kind of link types you’re traversing, and so forth. And also, of course, on the nature of the things in the world that the graph describes. Here are the Gremlin results when we loop 5 times instead of 2:

==>v[http://dbpedia.org/resource/John_Lithgow]=732 [...]

Map-reduce-merge and Hadoop/Hbase RDF

 Just found this interesting presentation,

Map-Reduce-Merge:  Simpli?ed Relational  Data Processing on  Large Clusters
by Hung-chih Yang, Ali Dasdan Ruey-Lung Hsiao, D. Stott Parker; as presented by Nate Rober  (PDF)


Extending MapReduce
1. Change to reduce phase
2. Merge phase
3. Additional user-de?nable operations
a. partition selector
b. processor
c. merger
d. con?gurable iterators

Implementing Relational Algebra Operations
1. Projection
2. Aggregation
3. Selection
4. Set Operations: Union, Intersection, Difference
5. Cartesian Product
6. Rename
7. Join

[for more detail see full slides]

MapReduce & GFS represent a paradigm shift in data processing: use a simpli?ed interface instead of overly general DBMS.
Map-Reduce-Merge adds the ability to execute arbitrary relational algebra queries.
Next steps: develop SQL-like interface and  a query optimizer.

Research paper: Map-reduce-merge: simplified relational data processing on large clusters (PDF for ACM people)

Linked from HRDF page in the Hadoop wiki, where there appears to be a proposal brewing to build an RDF store on top of the Hadoop/Hbase infrastructure.

Nearby: LargeTripleStores in ESW wiki

Not entirely unrelated: Google Social Graph API  (which parsers FOAF/RDF from ‘The Web’ but discards all but the social graph parts currently)