CheckRDFSyntax and Schemarama Revisited

So I meant to write about a 1-line piece of Javascript, but ended up with a 5000 word freeform essay on the nature of RDF, XML, validation and so forth. It could probably do with some editing, but for now the words are in pretty much the order they came out of my brain. A short summary: thinking about our expectations of RDF “validation” can teach us a lot about RDF’s value, about it’s relationship to XML, and about the things we should focus on building next.

I’ve just made a Javascript “favelet” for checking documents against the RDF/XML syntax. It uses the W3C RDF Validator, which in turn uses the ARP RDF parser from Jena.

The Javascript: CheckRDFSyntax

I have made some assumptions about the options passed to the validator; these are easily adjusted by looking at the source of the normal submission page. It currently asks for ‘N-Triples’ syntax back (this is more compact than the default tabular view), and also has the ‘rdf:RDF is omitted’ flag set, which is useful for checking documents who indicate their RDF-ness in some other way (eg. DOAP files with a content-type of ‘application/rdf+xml’).

The favelet (aka ‘bookmarklet’) exploits the ability to send ‘GET’ requests to the RDF validator. The current main form uses ‘POST’; it’s possible that GETs might be disallowed in the future, eg. for server-load issues. I have set the Javascript to not ask for images; it is probably best to leave things that way, since to do otherwise could overload this (free and useful) service.

Some more Favelets for the W3C Markup Validation Service are also available. If people find this one useful, I’ll see about getting it linked from that page and from the RDF validator itself.

Why call it “Check RDF syntax” instead of “validate RDF”, you might ask?

There’s a long answer and a short answer. I intended to keep this short, but failed. Basically, the word “validate” is terribly overloaded, particularly when we compare what it means in XML with the world of RDF. RDF users all too frequently ask for the ability to “validate” a document against an RDF schema, when what they often want is one of several things. Explaining the difference is not the easiest of things to do…

When someone wants to “validate RDF”, what could this mean? They may want to “check that it’s OK” against the rules of RDF/XML itself, or they may want to make sure — in some hard to articulate way — that it doesn’t violate anything said by the RDF schemas (aka “vocabularies”, “ontologies”, etc.) used in the document. Either way, they want some kind of sanity check, ie. “did I screw up?”.

Checking against the RDF/XML grammar is what I’m calling “RDF syntax checking”, and what W3C’s RDF Validation Service is all about. RDF syntax validation only makes sure that your XML is the right shape for sending RDF graphs around, eg. that you’ve used rdf:resource and rdf:about attributes in the right places, that the element nesting patterns are correct, etc. We might call this “RDF/XML-wellformed” by analogy with XML’s own notion of well-formedness. For each concrete RDF syntax, eg. RDF/XML, N3, N-Triples, XHTML2′s RDF/A etc., there would be a different version of syntax checking, to make sure that the document maps into RDF’s abstract graph structures.

The other concept of validation draws us into thinking about the nature of the Semantic Web, and about the differences between RDF and XML.

RDF documents (and their schemas) are about making simple claims about the world, structured in terms of classes (ie. categories) and relationships/properties. XML schemas (W3C XML Schema, DTDs, and others) do something similar, but they do it indirectly, by making statements about XML element structures for describing things in the world. The difference is subtle.

When an RDF schema has markup “defining” something like eg:ShippingAddress, it is talking about a class of thing-in-the-world. The RDF schema (or OWL ontology; OWL extends RDFS) expresses some generalisations about those things in the world that are shipping addresses.

When an XML schema has markup for eg:ShippingAddress it looks at first glance the same, when in fact something utterly different is going on. The XML schema is expressing some generalisations about XML document structures. It is telling us some rules for checking XML documents against the schema, expressed in terms of XML element containment, allowed attribute structures, and relationship to datatypes. It says “if your document has one of these here, and two of those there, and and no thingybobs inside a such’n'so, then it is valid as far as I’m concerned”. In other words, it provides clear, machine-checkable rules for testing whether a document (or document-subsection) falls into some useful category. You can therefore validate an XML document against such schemas, to find out if you’ve missed some essential information, or if you’ve over-enthusiastically included more information than the schema-designer was expecting (“Hey, this is a eg:ShippingAddress, what’s that geo:lat, geo:long, photo:Image and foaf:aimChatID doing in there? Invalid!”).

RDF is not like that. You can’t easily do this kind of validation in RDF. RDF schemas don’t care what information you choose to include in some document, nor what other forms of information you mix it with. RDF is pretty mellow about all that, an attitude which can by turns be liberating and infuriating, depending on what you’re trying to do.

In RDF, missing isn’t broken. You can, from the point of view of an RDF schema or OWL ontology, always omit stuff from your RDF/XML documents. They don’t care, because RDF schemas express claims about the world, and not about the XML documents that describe that world. They say things like primaryAuthor is a relationship between a Document and an Agent; the don’t say anything about syntax, nor about how much information anybody ought to provide about documents, agents or their interelationships. The authors of XML schemas get to say such things; authors of RDF schemas don’t. This is neither good nor bad, just different. It’s a difference grounded in the essential differences between RDF and XML.

Pedantic aside: are XML documents not “in the world” too? Indeed so. We could imagine an RDF/OWL ontology that defined classes called things such as “Element”, “Attribute”, and other core concepts from XML. In fact this work has been done already; see the RDF Schema for the XML Infoset. It hasn’t been updated to use OWL, though, so it doesn’t capture many of the core generalisations about XML document structure. See also an early XML schema proposal, Document Content Description for XML. This was “an RDF vocabulary designed for describing constraints to be applied to the structure and content of XML documents”. Note that those constraints were largely invisible to RDF itself; RDF in DCD was a carrier for information about XML documents and sub-classes of XML document, but generic RDF/OWL tools wouldn’t be able to interpret such descriptions and constraints to classify documents into XML-valid or XML-invalid.

Getting back on topic: RDF schemas make generalisations about the world, XML schemas make generalisations about XML document types. In this way, most XML schema languages can be thought of as a special-purpose “ontology language” optimised for dealing with the domain of XML documents.

How does this relate to validation?

The concept of validation in the XML world is all about checking whether some input is (a) well-formed XML (b) structured according to the XML element/attribute rules of some XML schema. Does it have the right bits of information, in the right places, the right order? Are there any extra bits where there shouldn’t be? And so on.

The concept of validation in the RDF world is necessarily more permissive. First, just as in XML, we check basic syntax. In fact for XML-based RDF notations, like RDF/XML and XHTML2′s RDF/A, we do a lot of the same checking, since the same rules apply. But once we know we have an RDF graph, ie. a representation of a set of subject/predicate/object triples which make simple statements about the world, … what do we do next?

This is where the “liberating and infuriating” duality comes in. Freedom, horrible freedom. Once you’re past the “yup, it’s an RDF graph” stage, it isn’t always clear what kind of machine-checking to do next. The expectations we have from the world of XML schema (as well as from some OO-notions of class hierarchy modelling, and other sources) encourage us to look for a way to “validate” our RDF/XML documents against application schemas. So for example, we might think that we want to check a document against the RDF schema for Dublin Core, or FOAF, or RSS1, or MusicBrainz, Creative Commons, etc.

What could an RDF system do, to check that you’ve not screwed up when writing documents that uses these schemas? Basically, all it can do is look for contradictions, ie. where you make statements about the world that simply couldn’t be so. It might, for example, remind you that you’re saying that a xyz:Document has a geo:lat of “52.1″, yet the domain of geo:lat is geo:SpatialThing, and xyz:Document and geo:SpatialThing are marked as owl:disjointWith each other. In other words, that what you’re saying doesn’t fit with the meaning of the terms given in the geo: and xyz: schemas. By ascribing a geo:lat to a document you are implicitly claiming that it is a spatial thing, yet the claims in the schema disagree with this, since they use RDF/OWL to claim that nothing can be both an xyz:Document and a geo:SpatialThing at the same time.

This is a pretty simple example; things become more compelling as schemas get more complex. This sort of checking is useful in schema/ontology design, as much as for checking of instance documents: it is easy for a schema to embody a conceptual confusion. The community around W3C’s OWL ontology language have a lot of scientific know-how in this area – eg. checking huge, complex ontologies for mistake. Think about the potential for error when reasoning about aircraft parts, or in the lifesciences.

There are some “OWL Validators” out there that can do this sort of checking. For eg see Mindswap’s Online OWL consistency checker, built using their Pellet reasoner. There’s also a similar validator at Manchester.
BBN’s OWL validator is more concerned with checking the abstract (RDF-encoded) syntax of OWL, ie. it does not do full inference.

How useful is this sort of logical checking for simpler metadata applications? eg. RSS for data syndication, Dublin Core, FOAF etc. Well, it is a start. But it barely scratches the surface of what could be built on top of RDF. There are so many ways we can screw up our data, and only some of those are manifested as machine-checkable in the above manner. There are other forms of bad data than those that make logical errors.

For example, we could write (as many do), dc:author instead of dc:creator. That term isn’t defined by Dublin Core. Or we could spell a namespace wrong (I had http://xmlns.com/1.0/foaf/ instead of http://xmlns.com/0.1/foaf/ in the FOAF schema itself for a while; I found the mistake last week while using one of the OWL validators above). Checking for those kinds of errors is useful, and some RDF tools (eg. Cwm, Jena Eyeball) are increasingly offering more “lint”-style facilities. This trend is important, and a huge thing for RDF usability and deployment.

There is, however, another form of “RDF checking” that deserves much more attention and research, and which I’ll even dare to claim, may prove critical to getting widespread adoption of Semantic Web technology in the public Web.

This is the idea of checking our RDF/XML documents against descriptive patterns that capture application-specific information needs. In the Dublin Core community, these are called “application profiles”. In the XML world, Rick Jelliffe’s excellent Schematron system has led the way. The idea, roughly, is that real-world applications often have information needs that are not expressed in schema definitions, since they are not shared by all users of the schema. This is a natural side effect of the admirable urge to use common schemas (whether XML or RDF) across the globe, as well as an acknowledgement that documents and data aren’t static, but part of complex lifecycle in which different checks are appropriate in different environments. In the RDF world, such checking is also important, since we don’t have XML’s native concern for “missing” or “unexpected” chunks of data. We just have a graph, that is an unordered set of triples whose meaning is governed largely by schema-dictionary definitions of the property and class names used.

As Edd Dumbill put it,

Processing RDF is therefore a matter of poking around in this graph. Once a program has read in some RDF, it has a ball of spaghetti on its hands. You may like to think of RDF in the same way as a hashtable data structure — you can stick whatever you want in there, in whatever order you want.

Such liberating flexibility! Semi-structured chaos! How can on earth programmers be expected to deal with it? What OO coder would replace their nicely organised structures with a hashtable?

I’m always amused when RDF and the Semantic Web are misrepresented as an exercise in formalistic centralisation, as promoting an ivory tower “one perfect ontology for the planet” and so on. True, we have our formalists, and I for one am eternally grateful to them for the huge amount of work the formal KR guys put into cleaning up W3C’s RDF specs. But RDF, if conceived of as a naive attempt to create a machine-readable theory of everything, is tragically misunderstood. RDF is a strategy for principled decentralisation in a world where unanticipated data re-use, unanticipated data extensions, are valued. It is anything but centralised.

RDF says, to schema authors: “don’t model how you think your data should be written in XML, model its assumptions about the world”. It says, “don’t tell me what can go inside a workplaceHomepage tag; tell me what kinds of things are related by the workplaceHomepage property”. And to the authors of RDF/XML documents it says, “I won’t pretend to know your information needs better than you; put any RDF statements into the graph that you think are meaningful, useful, affordable and shareable”.

For this to work, it takes away some choices from schema authors. There are all kinds of assumptions that can accompany XML schemas, which RDF takes away from the authors of RDF schemas. Most obviously, it imposes a particular syntax. In RDF/XML (one of many concrete RDF notations) we have all those rdf:about and rdf:resource attributes, alongside that sometimes-verbose “striped” notation. So schema authors, as they move from XML to RDF, give up their right to decide what XML tag structures to impose on their own users. They let RDF do that, and trust the RDF community to keep coming up with new and better notations that can be shared across all such schemas. Currently we have N3, the evolving XHTML2 RDF/A work and GRDDL as evidence that the RDF community take this responsibility seriously.

What else do schema authors give up? They give up their right to make rules couched in terms of data being missing versus present; defaults, for example. RDF doesn’t impose those kinds of restriction on the creators of RDF documents.

For example, looking at SMBmeta from Dan Bricklin (great name!). The smbmeta spec tells us:

The “country is assumed to be “us” if omitted

For better or worse, you simply couldn’t say that in an RDF schema. To attempt to do so would be to misunderstand RDF entirely. RDF applications, given some other background knowledge, might in certain circumstances be justified in making that conclusion. But you can’t say “anybody who uses my schema to describe a Business, but omits the country code, is implicitly saying that the business is in the USA”.

Why? RDF is designed for open-world, data sharing apps, where content is syndicated, re-syndicated, merged, separated, shredded and glued back together. A single document might draw on a dozen different vocabularies, mixed tightly together at the elements and attributes level. It would be completely impractical for applications to have to read and understand the corresponding schemas, their idiosyncratic defaulting rules, and the interactions between those rules across schemas.

To make something like SMBmeta markup play nicely in the RDF world, we need to do more than simply analyse its XML structures and come up with a corresponding structure of RDF class and property names. We need to look deeper at the assumptions in the data, so that we don’t trample over the intended meaning of the XML when bringing it into the RDF environment. I took a look at making an SMBmeta file into RDF yesterday. More on that another time. The file got a bit uglier — the so-called RDF tax — to which it is tempting to say, “oh, we’ll just use a different RDF syntax”. But there’s a deeper issue, intimately bound up with the idea of document validity and data checking discussed here.

By moving a format to use RDF, we restrict the kinds of things a schema author can demand of his or her user. In RDF, you don’t get to say “if country is missing, assume USA”. This is why I consider GRDDL (the use of XSLT to go from non-RDF XML into RDF/XML) to be only part of the solution to the “Cambridge Communiqué problem of better connecting RDF to XML. So whenever we try to RDFize some XML format, we have to look very carefully for these kinds of assumptions, since they don’t play well in the RDF data-mixing world. The Atom syndication format is another example I’ve been looking at. It turns out that we can construct a document that meets the syntactic rules of both Atom and RDF/XML, ie. there is a profile of Atom could be used directly as RDF (see syntax check which will gripe about namespace prefixes). But before we rush around celebrating, we need to check the Atom spec very carefully, to see if the meaning of Atom terms makes sense in the RDF world (eg. defaulting rules — would RDF applications miss out on extra data? that’s bearable; would they misunderstand Atom data? that’s not).

What other constraints do we take away from XML schema authors? Here’s a big one: attaching meaning to XML element order. RDF graphs, like SQL relational tables, are unordered. This is because they correspond to logical assertions about the world. The importance of this, when trying to understand RDF, simply can’t be overestimated.

A couple of quick examples: in the RSS 1.0, we wanted the RDF content to include an ordered list of the items described in the feed. Because RDF doesn’t preserve XML element ordering after you’ve parsed the document, we had to find a way of putting that ordering information into the graph. The design we chose was to have an rdf:Seq structure at the top of the document. More recently, I have this week been looking into the possibilities for reflecting GML into RDF. GML syntax looks a lot like RDF already, and I was able to make an example into parseable RDF very easily. However, GML often needs to describe ordered lists of points (eg. for polygons on a map). When I looked more closely at the RDFization design, I realised I had created meaningless data, since the RDF statements didn’t preserve the XML element ordering information that GML uses to link a set of points into a line. The following markup looks like XML/RDF, but doesn’t preserve enough information. There are other designs (eg. using RDF collections, or packing data into microsyntax-formatted strings) that need to be explored.


<Railway>
<gml:name>Track 29</gml:name>
<centerLine>
<gml:LineString>
<gml:pos>100 100</gml:pos>
<gml:pos>200 200</gml:pos>
<gml:pos>300 300</gml:pos>
</gml:LineString>
</centerLine>
</Railway>

The exercise of “RDFizing” an XML format consists of more than mapping from a tree-based data structure into an edge-labelled graph one. We also have to go through and find out how the original format deals with things like missing data, defaulting rules, and the use of XML element order to carry meaning. Once in the RDF world, all vocabularies behave the same; we take some flexibility and choice away from schema authors, so that end-user applications can enjoy a different kind of flexibility and choice.

It is important to RDF that element order information can (usually) be thrown away. This design allows us to grab data from multiple sources, and integrate it relatively cheaply. But it is also grounded in the workings of logic and human communication. Consider that we can say in English something like:

There’s a person whose name is “Dan Brickley”, whose birthday is January 9th, and who works for the organisation which has a homepage identified by the URI <http://www.w3.org/> and which has a title “World Wide Web Consortium”.

Instead of that, we could have said something a bit different, but which would be true or false of the world under exactly the same circumstances:

A thing with a title of “World Wide Web Consortium”, identified by the URI <http://www.w3.org/> is a homepage of an organisation that a person born on January 9th called “Dan Brickley” works for.

Machines are really pretty bad at natural language parsing, because they lack commonsense. They can’t understand the vast hidden context behind our utterances. RDF won’t fix that. Ever. It is a design for data sharing in a world where machines have these terrible limitations.

Sometimes it makes sense to adjust our human practices to make things easier on computers, when we want the computers to do things we’d rather not do ourselves. So for example, we simplify our handwriting, so our PDAs can recognise the letters in our handwriting. RDF is a simplification of the practice of making basic claims about the world. The above two English-prose paragraphs are true in exactly the same circumstances, and false in exactly the same circumstances, regardless of the re-ordering of the claims.

W3C went to great lengths to make sure that RDF shares these logical characteristics, since it is fundamental to data-sharing on a global scale (which is what W3C is all about). Let’s see the RDF version written in XML:


<Person xmlns="http://xmlns.com/foaf/0.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<name>Dan Brickley</name>
<birthday>01-09</birthday>
<workplaceHomepage>
<Document rdf:about="http://www.w3.org/">
<dc:title>World Wide Web Consortium</dc:title>
</Document>
</workplaceHomepage>
</Person>

Here's another way we might have written those exact same 6 statements:


<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/dc/elements/1.1/"
xmlns:foaf="http://xmlns.com/foaf/0.1/">
<foaf:Document rdf:about="http://www.w3.org/">
<title>World Wide Web Consortium</title>
</foaf:Document>
<foaf:Person>
<foaf:workplaceHomepage rdf:resource="http://www.w3.org/"/>
<foaf:birthday>01-09</foaf:birthday>

<foaf:name>Dan Brickley</foaf:name>
</foaf:Person>
</rdf:RDF>

In both cases, RDF sees only the underlying 3-part statements. Six of them:

  • _:x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
  • _:x <http://xmlns.com/foaf/0.1/name> “Dan Brickley” .
  • _:x <http://xmlns.com/foaf/0.1/birthday> “01-09″ .
  • <http://www.w3.org/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document> .
  • <http://www.w3.org/> <http://purl.org/dc/elements/1.1/title> “World Wide Web Consortium” .
  • _:li <http://xmlns.com/foaf/0.1/workplaceHomepage> <http://www.w3.org/> .

The RDF Semantics spec explains the maths behind this far more carefully and cleverly than I could ever manage. But the idea is simple. Just as the two paragraphs of English prose are equivalent, regardless of statement ordering, so are the bits of RDF.

These RDF statements are no more intrinsic, meaningful order than the rows in a relational database, or the files in a directory on your computer. They might, of course, be sorted on various criteria. But the schemas used (in this case, Dublin Core and FOAF), like all RDF schemas, do not impose application-specific meaning on the ordering of the XML elements. To risk over-emphasising a point, neither do the creators of Dublin Core, or of FOAF, get to make up rules for what happens if data is missing from the graph. Data costs time and money to manage, and there are thousands of reasons why data might usefully be missing from an RDF graph. RDF pushes such concerns down into the application layer: if you have an application which wants the birthdays of all employees to be listed, that’s your own business. It is a separate problem from that of defining a shared markup for birthdays, or for representing employment.

So let’s look again at that graph of 6 statements, but as a diagram, since that emphasises the unordered nature of the data (follow link for full-size image):

6 rdf statements in a graph

As you can see, the RDF abstraction normalises both of the XML/RDF forms given above into a single structure. This is another of those “love it or hate it” features of RDF. Developers sometimes complain that RDF has too many ways of writing the same thing. Flipped around, this is a feature: RDF provides an account of what seemingly diverse descriptions have in common.

Consider an application that was trying to collect the birthdays of W3C team members, perhaps to output in iCalendar format, or to link to our Amazon wishlists and suchlike. The application doesn’t need to care about lots of things. It doesn’t care if it loads data from RDF/XML, or N3, or XHTML2/RDFA, or from a GRDDL-based XSLT transformation. All it cares about is scooping up some statements about people, their “workplace homepage”, and their names and birthdays. In this simple example, all of that information can be expressed using a single RDF vocabulary, FOAF. But it is essential to realise that the RDF application’s information needs could very easily have drawn upon other RDF vocabularies (eg. it might be looking for birthdays of people who work for W3C Member organizations, and who have contributed to W3C standards. FOAF isn’t enough to express that, but mixed in with other schemas, much of that information is available in the Semantic Web already.

Forget about validation, RDF checking, and constraints for a while. Think about querying… about how an RDF application would express a request for some chunks of data (whether using one schema, or several). W3C has a new RDF query language for this, called SPARQL. Here’s how we would use SPARQL to ask for W3C team birthdays. If using the 6-statement example, we won’t get many answers. The query would be:


PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?name, ?bday
WHERE {
?p rdf:type foaf:Person .
?p foaf:name ?name .
?p foaf:birthday ?bday .
?p foaf:workplaceHomepage <http://www.w3.org/> .
}
ORDER BY ?name

The query describes a pattern to be matched against some RDF data, and is answered with a table of variable-to-value bindings, pretty much like SQL. The query looks a lot like an RDF graph, in fact, with “?” marking variables, ie. nodes in the graph where we don’t know the specific content of the node. Our application query here says “find me values for ?name and ?bday for anything (we’ll call it “p”) that has a type foaf:Person and a workplaceHomepage matching W3C’s homepage.

I don’t intend to give a SPARQL tutorial here. The official spec is the best place to start learning (and please do send review comments!); there are also some (slightly out of date) tutorial materials online. You might also look at the RDFAuthor tutorial, since RDFAuthor serves as an authoring tool both for RDF itself, and for RDF queries (using Squish, one of the languages that fed into the SPARQL design). That tutorial makes one point quite clearly: RDF queries and RDF graphs are pretty similar structures. If you delve into the details of SPARQL, you’ll find some points at which it departs the simple-minded world of RDF triples. It lets you ask queries, for example, where you say things about the source of the graph, as well as queries in which certain properties are marked as optional (eg. we might ask for photographic depictions of the people in the query, yet not want the query to fail if those bits of data weren’t available in the graph). Both of these features are useful and were expected when the SPARQL work began, but they move us away from the simple narrative of “RDF queries are just like RDF graphs with bits of graph marked as missing“. As a quick intro to RDF querying with SPARQL, that concept is worth hanging onto. And it also helps us think about the relationship between RDF query and RDF “validation” or data checking.

I mentioned the Schematron system earlier. Schematron is useful when thinking about how new kinds of RDF checking and validation might work. It is built on top of the XPath spec. By testing to see whether specific XPath addresses match against some target document, we can probe the contents to see whether they meet our application needs or not. The tagline on the Schematron site says it all: Schematron is a language for making assertions about patterns found in XML documents. A few words from the overview page, which describes how
you can develop and mix two kinds of schemas:

  • Report elements allow you to diagnose which variant of a language you are dealing with.
  • Assert elements allow you to confirm that the document conforms to a particular schema.

The approach is pretty simple:

  • First, find a context nodes in the document (typically an element) based on XPath path criteria;
  • Then, check to see if some other XPath expressions are true, for each of those nodes.

That’s it, really. Schematron schemas allow you to check XML documents against rules that go beyond those that come with the elements and attributes used by the document. You might, for example, be combining various namespaces, and want to have them combined in a certain markup pattern. Or you might want to apply different data-integrity checks at different points in a workflow.

This idea is powerful, and simple. I’ve long wanted the same approach, but defined over the RDF. Some time ago, Libby Miller prototyped it as Schemarama, using the Squish RDF query language. XPath doesn’t make much sense for a Schematron-for-RDF, since we want something that can be used against RDF graphs, rather than XML documents. Now that the SPARQL language is more or less finishing, I think it’s time to revisit the approach, but using SPARQL instead. SPARQL’s OPTIONALs mechanism, I think, makes it much more practical. There are also possibilities for using OWL’s data structures in a similar way; see Damian Steer’s Using OWL for Forms, Validation, and Application Profiles for RDF XTech paper. As he says,

One of the most common issues encountered by RDF developers is the need for some form of constraint on their data, and particulary validation. Unfortunately the RDF schema languages are (for perfectly good reasons) unsuited for this purpose. For example, property ranges are commonly misunderstood by newcomers to RDF as restricting possible values.

I’m not yet sure which path will be most fruitful. OWL isn’t really meant to express such application-oriented constraints, but the data structures look temptingly useful. I do lean towards layering a next-generation RDF checker on top of SPARQL facilities, for one reason. Queries capture application usage practice. They tell us what the application wants. The Semantic Web requires something of a cultural shift here, away from the idea that “validation” is solely the act of comparing some instance data against the schemas it uses. Those schemas (even if they use OWL ontology extensions) simply don’t provide information to do enough useful checking. Instead of this, we need to explore ways of sharing machine-readable characterisations of common RDF graph structures. In Schematron’s terms, we would build data-checking applications on top of a language for making assertions about patterns found in RDF graphs. This could all be layered on top of SPARQL, or on OWL, or on some new rule language. I don’t care how it is done, so long as it happens! When RDF is projected back into an XML environment, Schematron itself could even be used

The absense of this kind of data-checking is hurting RDF deployment, because it discourages RDF vocabulary designers from re-using other schemas in the instance data that their applications produce and consume.

Revisiting the example application considered above, imagine we have built some applications based around the idea of finding people’s birthday, name, and photo. We might use FOAF to describe the people info, Dublin Core to describe the image, Creative Commons to describe the usage rights for the image, and so forth. While W3C has a language for describing the meaning of the individual terms (ie. classes and properties) we’d use in our descriptions, it doesn’t yet have a standard way of capturing the “descriptive recipe” deployed in the application. We don’t have a non-prose (and hence language-neutral) way of saying how those pieces of RDF vocabulary are combined into a larger, and more useful, data structure. We don’t have a way of checking instance data against it, so that an “birthday photo app” validator could probe the RDF graph and say, “that file is valid for this purpose“.

The cultural shift we need, and the toolset to accompany it, is a shift to application-oriented validation. Instead of an absolute, universal “yes” or “no” for some RDF document, we need a more nuanced approach. An RDF document might be well-suited for use in a photo metadata application, but missing some data that is needed for an addressbook. The existance of a common framework for expressing such information needs would go a long way to addressing the “rummaging in spaghetti” feeling that developers have when they work with RDF, as well as the desire that application authors have for expressing XML-esque constraints (eg. the FOAFnet subset profile of FOAF).

The approach could be a simple as having online catalogues of “common queries”, so that datasources and applications could share a way of talking about the patterns of RDF that they produce and consume. Or it could be a lot fancier, there’s a lot to explore, and a lot to gain. SPARQL is probably the place to start exploring…

7 Responses to CheckRDFSyntax and Schemarama Revisited

  1. Jimmy says:

    Although the browser loads it as HTML or XHTML, try this….

    Use that bookmark on My SoC Project’s web site at http://swapi.ourproject.org/ Apache MIME type hacking is fun!

  2. [...] danbri Danny@11:39 | Semantic Web [...]

  3. [...] PS. After reading CheckRDFSyntax and Schemarama Revisited from danbri, I realised I’d missed an easy point here. Adam talked a lot about querying. Given the RDF scenario above, you don’t need to reshape the data for an RDBMS or whatever for it to be queryable, that’s alreeady possible with the RDF model with what SPARQL provides: [...]

  4. [...] Ok, a third link to danbri’s CheckRDFSyntax and Schemarama Revisited. In the must-read essay there’s a handy must-have bookmarklet: CheckRDFSyntax [...]

Leave a Reply