SPARQL results in spreadsheets

I made a little progress with SPARQL and spreadsheets. On Kingsley Idehen’s advice, revisited OpenOffice (NeoOffice on MacOSX) and used the HTML table import utility.

It turns out that if I have an HTTP URL for a GET query to ARC‘s SPARQL endpoint, and I pass in the non-standard format=htmltab parameter, I get an HTML table which can directly be understood by NeoOffice. I’ll write this up properly another time, but in brief, I hacked the HTML form generator for the endpoint to use GET instead of POST, making it easier to get URLs associated with SPARQL queries. Then in NeoOffice, under “Insert” / “Link to External Data…”, paste in that URL, hit return, select the appropriate HTML table … and the data should import.

Here’s some fragments of FOAF and related info in NeoOffice (from a query that asked for the data source URI, a name and a homepage URI):

Open your data?

I tried this on a version of Excel for MacOSX (an old one, 2001), but failed to get HTML import working. Instead I just copied the info over from NeoOffice, since I wanted to try learning Excel’s Pivot and charting tools. Here is a quick example derrived from an Excel analysis of the RDF properties that appear in the different graphs in my SPARQL store. The imported data was just two columns a GRAPH URI, ?g, and a property URI ?property:


PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?g ?property
WHERE {
GRAPH ?g { ?thing ?property ?value . }
}
ORDER BY ?property

Excel (and NeoOffice) offer Pivot and charting tools (similar to simplified version of what you’ll find under the Business Intelligence banner elsewhere) that take simple flat tables like this and (as they say) slice, dice, count and summarise. Here’s an unreadable chart I managed to make it generate:

SPARQL stats

Am still learning the ropes here… but I like the idea of combining desktop data analysis and reports with an ARC SPARQL store attached to my blog, and populated with data crawled from my web2ish ‘network neighbourhood’ (ie. FOAF, RSS, XFN, SIOC etc). Work in spare-time progress.

Commandline PHP for loading RDF URLs into ARC (and Twinkle for query UI)


#!/usr/bin/php
<?php
if ($argc != 2 || in_array($argv[1], array('--help', '-help', '-h', '-?'))) {
?>
This is a command line PHP script with one option: URL of RDF document to load
<?php
} else {

$supersecret = "123rememberme"; #Security analysts recommend using data of birth + social security ID here
# *** be careful with real msql passwords ***

include_once("../arc/ARC2.php");
$config = array( 'db_host' => 'localhost', 'db_name' => 'sg1', 'db_user' => 'sparql',
'db_pwd' => $supersecret, 'store_name' => 'crawl', );
$store = ARC2::getStore($config);
if (!$store->isSetUp()) { $store->setUp(); }
$profile = $argv[1];
echo "Loading data from " . $profile ;
$store->query('DELETE FROM <'.$profile.'>');
$store->query('LOAD <'.$profile.'>');
}
?>

FWIW, this is what I’m using to (re)load data into an ARC store from the commandline. I’ll try wiring up my old RDF crawler to this when I get time. Each loaded source is stored as a named graph, with the URI it is loaded from being the named graph URI. ARC just needs the path to the unpacked PHP libraries, and connection details for a MySQL database, and comes with a handy SPARQL endpoint script too, which I’ve been testing with Twinkle.

My public sandbox data is currently loaded up as follows. No promises it’ll stay there, but anyway, adding the following to Twinkle 2.0’s config file section for SPARQL endpoints works for me. The endpoint also directly offers a basic Web interface too, with HTML, XML, JSON etc.


<http://sandbox.foaf-project.org/2008/foaf/ggg.php>
a sources:Endpoint; rdfs:label "FOAF Social Graph Agggregator sandbox".

Shadows on the Web

This is a quick note, inspired by the recent burst of posts passing through Planet RDF about RDF, WebArch and a second “shadow” Web. Actually it’s not about that thread at all, except to note that Ian Davis asks just the right kind of questions when thinking about the WebArch claim that the Web ships with a hardcoded, timeless and built-in ontology, carving up the Universe between “information resources” and “non-information resources”. Various Talis folk are heading towards Bristol this week, so I expect we’ll pick up this theme offline again shortly! (Various other Talis folk – I’m happy to be counted as Talis person, even if they choose the worst picture of me for their blog :).

Anyhow, I made my peace with the TAG’s http-range-14 resolution long ago, and prefer the status quo to a situation where “/”-terminated namespaces are treated as risky and broken (as was the case pre-2005). But RDFa brings the “#” WebArch mess back to the forefront, since RDF and HTML can be blended within the same environment. Perhaps – reluctantly – we do need to revisit this perma-thread one last time. But not today! All I wanted to write about right now is the “shadow” metaphor. It crops up in Ian’s posts, and he cites Rob McCool’s writings. Since I’m unequiped with an IEEE login, I’m unsure where the metaphor originated. Ian’s usage is in terms of RDF creating a redundant and secondary structure that ordinary Web users don’t engage with, ie. a “shadow of the real thing”. I’m not going to pursue that point here, except to say I have sympathies, but am not too worried. GRDDL, RDFa etc help.

Instead, I’m going to suggest that we recycle the metaphor, since (when turned on its head) it gives an interesting metaphorical account of what the SW is all about. Like all 1-line metaphorical explanations of complex systems, the real value comes in picking it apart, and seeing where it doesn’t quote hold:

‘Web documents are the shadows cast on the Web by things in the world.

What do I mean here? Perhaps this is just pretentious, I’m not sure :) Let’s go back to the beginnings of the SW effort to picture this. In 1994 TimBL gave a Plenary talk at the first International WWW Conference. Amongst other things, he announced the creation of W3C, and described the task ahead of us in the Semantic Web community. This was two years before we had PICS, and three years before the first RDF drafts. People were all excited about Mosaic, to help date this. But even then, the description was clear, and well illustrated in a series of cartoon diagrams, eg:

TimBL 1994 Web semantics diagram

I’ve always liked these diagrams, and the words that went with them.

So much so that when I had the luxury of my own EU project to play with, they got reworked for us by Liz Turner: we made postcards and tshirts, which Libby delighted in sending to countless semwebbers to say “thanks!”. Here’s the postcard version:

SWAD-Europe postcard

The basic idea is just that Web documents are not intrinsically interesting things; what’s interesting, generally, is what they’re about. Web users don’t generally care much about sequences of unicode characters or bytes; we care about what they mean in our real lives. The objects they’re about, the agreements they describe, the real-world relationships and claims they capture. There is a Web of relationships in the world, describable in countless ways by countless people, and the information we put into the Web is just a pale shadow of that.

The Web according to TimBL, back in ’94:

To a computer, then, the web is a flat, boring world devoid of meaning. This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them. For example, a document might describe a person. The title document to a house describes a house and also the ownership relation with a person. [...]

On this thinking, Web documents are the secondary thing; the shadow. What matters is the world and it’s mapping into digital documents, rather than the digital stuff alone. The shadow metaphor breaks down a little, if you think of the light source as something like the Sun, ie. with each real-world entity shadowed by a single (authoritative?) document in the Web. Life’s not like that, and the Web’s not like that either. Objects and relationships in the real world show up in numerous ways on the Web; or sometimes (thankfully) not at all. If Web documents can be thought of as shadows, they’re shadows cast in many lights, many colours, and by multiple independent light sources. Some indistinct, soft and flattering; occasionally frustratingly vague. Others bright, harshly precise and rigorously scientific (and correspondingly expensive and experty to use). But the core story is that it’s the same shared world that we’re seeing in all these different lights, and that the Web and the world are both richer because life is illustrated from multiple perspectives, and because the results can be visible to all.

The Semantic Web is, on this story, not a shadow of the real Web, but a story about how the Web is a shadow of the world. The Semantic Web is, in fact, much more like a 1970s disco than a shadow…

HTML Imagemap authoring tool for MacOSX?


I’ve searched around in vain for one. I want to annotate my FOAF spec diagram with mouseover text and links into the documentation. Most of the tools I find are a decade or more old, or pay-to-play. I remember the Gimp image editor can do imagemaps, but it crashes on startup on my MacBook. I did find a nice Javascript editor the other day, but it had no “undo” function, which made complex work impossible. Maybe I’ll try Amaya. Suggestions very much welcomed! Are imagemaps *that* uncool these days? They’re just SVG and image metadata in another notation… (and imho one of the the more interesting scenarios for HTML-based microformattery). That last link has missing photos and out of date SVG, but might still be of interest. A surviving screenshot included here.

Nearby: (from SWAD-Europe hacking days) an imagemap2svg.xslt thanks to Max Froumentin.

Update: I installed Amaya 9.99 for Intel MacOSX. Sadly it couldn’t even display my JPEG properly, although it did do a better job showing OmniGraffle’s SVG output than Firefox.


cvs2svn migration tips

Some advice from Garrett Rooney on Subversion. I was asking about moving the historical records for the FOAF project (xmlns.com etc) from dusty old CVS into shiny new Subversion. In particular, from CVS where I own and control the box, to Subversion where I’m a non-root normal mortal user (ie. Dreamhost customer).

The document records are pretty simple (no branches etc.). In fact only the previous versions of the FOAF RDF namespace document are of any historical interest at all. But I wanted to make sure that I could get things out again easily, without owning the Svn repository or begging the Dreamhost sysadmins.

Answer: “…a slightly qualified yes. assuming the subversion server is running svn 1.4 or newer a non-root user can use the svnsync tool to copy the contents (i.e. entire history they have read access to) of one repository into a new one. with servers before 1.4 it’s still possible to extract the information, but it’s more difficult.

And finding the version number? “if they’re letting you access svn via http you can just browse a repository and by default it’ll tell you the server version in the html view of the repository“.

Easily done, we’re on 1.4.2. That’s good.

Q: Any recommendations for importing from CVS?
A: “Converting from cvs to svn is a hellish job, due to the amount of really necessary data that cvs doesn’t record. cvs2svn has to infer a lot of it, and it does a damn good job of it considering what a pain in the ass it is. I’m immediately skeptical of anyone who goes out and writes another conversion tool ;-)

If you don’t have the ability to load a dumpfile into the repository you can load it into a local repos and then user svnsync to copy that into a remote empty repository. svnsync unfortunately won’t let you load into a non-empty repository, if you want to do that you need to use a svnadmin load, which requires direct access to the repository. most hosting sites will give you some way to do that sort of thing though.

Thanks Garrett!

OpenID plugin for WordPress

I’ve just installed Alan J Castonguay’s WordPress OpenID plugin on my blog, part of a cleanup that included nuking 11000+ comments in the moderation queue using the Spam Karma 2 plugin. Apologies if I zapped any real comments too. There are a few left, at least!

The OpenID thing appears to “just work”. By which I mean, I could log in via it and leave a comment. I’d be super-grateful if those of you with OpenIDs could take a minute to leave a comment on this post, to see if it works as well as it seems to. If it doesn’t, a bug report (to danbrickley@gmail.com) would be much appreciated. Those of you with LiveJournals or AOL/AIM accounts already have OpenID, even if you didn’t notice. See the HTML source for my homepage to see how I use “danbri.org” as an OpenID while delegating the hard work to LiveJournal. For more on OpenID, check out these tutorial slides (flash/pdf) from Simon Willison and David Recordon.

Thinking about OpenID-mediated blog comments, the tempting thing then would be to do something with the accumulated URIs. The plugin keeps its data in nice SQL tables and presumably accessible by other WordPress plugins. It’s been a while since I made a WordPress plugin, but they seem to have a pretty good framework accessible to them now.

mysql> select user_id, url from wp_openid_identities;
+---------+--------------------+
| user_id | url                |
+---------+--------------------+
|      46 | http://danbri.org/ |
+---------+--------------------+
1 row in set (0.28 sec)

At the moment, it’s just me. It’d be fun to try scooping up RDF (FOAF, SKOS, SIOC, feeds…) from any OpenID URIs that accumulate there. Hmm I even wrote up that project idea a while back – SparqlPress. At the time I tried prototyping it in Redland + PHP, but nowadays I’d probably use Benjamin Nowack’s ARC library, which provides SPARQL query of a MySQL-backed RDF store, and is written in PHP. This gives it the same dependencies as WordPress, making it ideal for pluginization. If anyone’s looking for a modest-sized practical SemWeb project to hack on, that one could be a lot of fun.

There’s a lot of interesting and creative fuss about “social networking” site interop around lately, largely thanks to the social graph paper from Brad Fitzpatrick and David Recordon. I lean towards the “show me, don’t tell me” approach regarding buddylists and suchlike (as does Julian Bond with Ecademy), which is why FOAF has only ever had the mild-mannered “knows” relationship in the core vocabulary, rather than trying to over-formalise “bestest friend EVER” and other teenisms. So what I like about this WordPress plugin is that it gives some evidence-based raw material for decentralised social networking apps. Blog comments don’t tell the whole story; nothing tells the whole story. But rather than maintain a FOAF “knows” list (or blogroll, or blog-reader config) by hand, I’d prefer to be able to partially automate it by querying information about whose blogs I’ve commented on, and vice-versa. There’s a lot that could be built, intimidatingly much, that it’s hard to know where to start. I suggest that everyone in the SemWeb scene having an OpenID with a FOAF file linked from it would be an interesting platform from which to start exploring…

Meanwhile, I’ll try generating an RDF blogroll from any URIs that show up in my OpenID WordPress table, so I can generate a planetplanet or chumpologica configuration automatically…

Who, what, where, when?

A “Who? what? where? when?” of the Semantic Web is taking shape nicely.

Danny Ayers shows some work with FOAF and the hCard microformat, picking up a theme first explored by Dan Connolly back in 2000: inter-conversion between RDF and HTML person descriptions. Danny generates hCards from SPARQL queries of FOAF, an approach which would pair nicely with GRDDL for going in the other direction.

Meanwhile at W3C, the closing days of the SW Best Practices Group have recently produced a draft of an RDF/OWL Representation of Wordnet. Wordnet is a fantastic resource, containing descriptions of pretty much every word in the English language. Anyone who has spent time in committees, deciding which terms to include in some schema/vocabulary, must surely feel the appeal of a schema which simply contains all possible words. There are essentially two approaches to putting Wordnet into the Semantic Web. A lexically-oriented approach, such as the one published at W3C for Wordnet 2.0, presents a description of words. It mirrors the structure of wordnet itself (verbs, nouns, synsets etc.). Consequently it can be a complete and unjudgemental reflection into RDF of all the work done by the Wordnet team.

The alternate, and complementary, approach is to explore ways of projecting the contents of Wordnet into an ontology, so that category terms (from the noun hierarchy) in Wordnet become classes in RDF. I made a simplistic approach at this some time back (see overview). It has appeal (alonside the linguistic version) because it allows RDF to be used to describe instances of classes for each concept in wordnet, with other properties of those instances. See WhyWordnetIsCool in the FOAF wiki for an example of Wordnet’s coverage of everyday objects.

So, getting Wordnet moving into the SW is great step. It gives us URIs to identify a huge number of everyday concepts. It’s coverage isn’t complete, and it’s ontology is kinda quirky. Aldo Gangemi and others have worked on tidying up the hierarchy; I believe only for version 1.6 of Wordnet so far. I hope that work will eventually get published at W3C or elsewhere as stable URIs we can all use.

In addition to Wordnet there are various other efforts that give types that can be used for the “what” of “who/what/where/when”. I’ve been talking with Rob McCool about re-publishing a version of the old TAP knowledge base. The TAP project is now closed, with Rob working for Yahoo and Guha at Google. Stanford maintain the site but aren’t working on it. So I’ve been working on a quick cleanup (wellformed RDF/XML etc.) of TAP that could get it into more mainstream use. TAP, unlike Wordnet, has more modern everyday commercial concepts (have a look), as well as a lot of specific named instances of these classes.

Which brings me to (Semantic) Wikipedia; another approach to identifying things and their types on the Semantic Web. A while back we added isPrimaryTopicOf to FOAF, to make it easier to piggyback on Wikipedia for RDF-identifying things that have Wiki (and other) pages about them. The Semantic Mediawiki project goes much much further in this direction, providing a rich mapping (classes etc.) into RDF for much of Wikipedia’s more data-oriented content. Very exciting, especially if it gets into the main codebase.

So I think the combination of things like Wordnet, TAP, Wikipedia, and instance-identifying strategies such as “isPrimaryTopicOf”, will give us a solid base for identifying what the things are that we’re describing in the Semantic Web.

And regarding. “Where?” and “when?” … on the UI front, we saw a couple of announcements recently: OpenLayers v1.0, which provides Google-maps-like UI functionality, but opensource and standards friendly. And for ‘when’, a similar offering: the timeline widget. This should allow for fancy UIs to be wired in with RDF calendar or RDF-geo tagged data.

Talking of which… good news of the week: W3C has just announced a Geo incubator group (see detailed charter), whose mission includes updates for the basic Geo (ie. lat/long etc) vocabulary we created in the SW Interest Group.

Ok, I’ve gone on at enough length already, so I’ll talk about SKOS another time. In brief – it fits in here in a few places. When extended with more lexical stuff (for describing terms, eg. multi-lingual thesauri) it could be used as a base for representing the lexically-oriented version of Wordnet. And it also fits in nicely with Wikipedia, I believe.

Last thing, don’t get me wrong — I’m not claiming these vocabs and datasets and bits of UI are “the” way to do ‘who/what/where/when’ on the Semantic Web. They’re each one of several options. But it is great to have a whole pile of viable options at last :)

FOAF vocabulary management details

I have an action from the Vocabulary Management task force of the W3C SW Best Practices Working Group to document the way the FOAF namespace currently works. Here goes.

The FOAF RDF vocabulary is named by the URI http://xmlns.com/foaf/0.1/ and has information available at that URI in both machine-friendly and human-friendly form. When that URI is dereferenced by HTTP, the default representation is currently HTML-based.

Aside: I say based because the markup, while basically being XHTML, also includes incline RDF/XML describing the FOAF vocabulary. The HTML TF of SWBPD WG is looking at ways of making such documents validatable; I hope a future version will be valid in some way, as well as presentable in mainstream browsers. This might be via RDF/A in XHTML2, RDF/A back-ported to an early XHTML variant, a GRDDL-based transform, or perhaps a CDF document if they’re deployable in legacy browsers.

Backing up to the big picture: we want to make it easy for both people and machines to find out what they need. So the basic idea is that there’s an HTML document describing FOAF for humans, and a set of RDF statements describing FOAF for machines. The RDF statements are available in several ways. As embedded RDF/XML in the HTML page. As a separate index.rdf document (this should be LINK REL’d from the former), and as a content-negotiable representation of the main URI, available to clients that send an “Accept: application/rdf+xml” header.

In addition to this, each FOAF term (assuming the Apache .htaccess is up to date; this might not be currently true) is redirected by the Web server to the main FOAF namespace URI. For example, http://xmlns.com/foaf/0.1/Person. This should in the future be done with a 303 HTTP redirection code, to be consistent with the recent W3C TAG decision on http-range-14 (apologies for the jargon and lack of links to explanation). Because FOAF term URIs don’t contain a # [edit: I wrote / previously; thanks mortenf], note that the redirection can’t point down into a sub-section of the HTML document. To make the HTML document more usable, we should probably put a better table of contents for terms nearer to the top of the page, to allow someone to navigate quickly to the description of that term.

One last point: each term defined in FOAF is accompanied not only by a label and comment (standard from RDFS); it also has a chunk of HTML markup, with cross-references to other terms. At the moment, this is not made available in any machine-readable form. There might be some scope here for common practice with SKOS and other vocabularies?

Sorry for the hasty writeup, I just wanted to get this recorded as a starting point…