I have made some assumptions about the options passed to the validator; these are easily adjusted by looking at the source of the normal submission page. It currently asks for ‘N-Triples’ syntax back (this is more compact than the default tabular view), and also has the ‘rdf:RDF is omitted’ flag set, which is useful for checking documents who indicate their RDF-ness in some other way (eg. DOAP files with a content-type of ‘application/rdf+xml’).
Some more Favelets for the W3C Markup Validation Service are also available. If people find this one useful, I’ll see about getting it linked from that page and from the RDF validator itself.
Why call it “Check RDF syntax” instead of “validate RDF”, you might ask?
There’s a long answer and a short answer. I intended to keep this short, but failed. Basically, the word “validate” is terribly overloaded, particularly when we compare what it means in XML with the world of RDF. RDF users all too frequently ask for the ability to “validate” a document against an RDF schema, when what they often want is one of several things. Explaining the difference is not the easiest of things to do…
When someone wants to “validate RDF”, what could this mean? They may want to “check that it’s OK” against the rules of RDF/XML itself, or they may want to make sure — in some hard to articulate way — that it doesn’t violate anything said by the RDF schemas (aka “vocabularies”, “ontologies”, etc.) used in the document. Either way, they want some kind of sanity check, ie. “did I screw up?”.
Checking against the RDF/XML grammar is what I’m calling “RDF syntax checking”, and what W3C’s RDF Validation Service is all about. RDF syntax validation only makes sure that your XML is the right shape for sending RDF graphs around, eg. that you’ve used
rdf:about attributes in the right places, that the element nesting patterns are correct, etc. We might call this “RDF/XML-wellformed” by analogy with XML’s own notion of well-formedness. For each concrete RDF syntax, eg. RDF/XML, N3, N-Triples, XHTML2’s RDF/A etc., there would be a different version of syntax checking, to make sure that the document maps into RDF’s abstract graph structures.
The other concept of validation draws us into thinking about the nature of the Semantic Web, and about the differences between RDF and XML.
RDF documents (and their schemas) are about making simple claims about the world, structured in terms of classes (ie. categories) and relationships/properties. XML schemas (W3C XML Schema, DTDs, and others) do something similar, but they do it indirectly, by making statements about XML element structures for describing things in the world. The difference is subtle.
When an RDF schema has markup “defining” something like
eg:ShippingAddress, it is talking about a class of thing-in-the-world. The RDF schema (or OWL ontology; OWL extends RDFS) expresses some generalisations about those things in the world that are shipping addresses.
When an XML schema has markup for
eg:ShippingAddress it looks at first glance the same, when in fact something utterly different is going on. The XML schema is expressing some generalisations about XML document structures. It is telling us some rules for checking XML documents against the schema, expressed in terms of XML element containment, allowed attribute structures, and relationship to datatypes. It says “if your document has one of these here, and two of those there, and and no thingybobs inside a such’n’so, then it is valid as far as I’m concerned”. In other words, it provides clear, machine-checkable rules for testing whether a document (or document-subsection) falls into some useful category. You can therefore validate an XML document against such schemas, to find out if you’ve missed some essential information, or if you’ve over-enthusiastically included more information than the schema-designer was expecting (“Hey, this is a
eg:ShippingAddress, what’s that geo:lat, geo:long, photo:Image and foaf:aimChatID doing in there? Invalid!”).
RDF is not like that. You can’t easily do this kind of validation in RDF. RDF schemas don’t care what information you choose to include in some document, nor what other forms of information you mix it with. RDF is pretty mellow about all that, an attitude which can by turns be liberating and infuriating, depending on what you’re trying to do.
In RDF, missing isn’t broken. You can, from the point of view of an RDF schema or OWL ontology, always omit stuff from your RDF/XML documents. They don’t care, because RDF schemas express claims about the world, and not about the XML documents that describe that world. They say things like
primaryAuthor is a relationship between a
Document and an
Agent; the don’t say anything about syntax, nor about how much information anybody ought to provide about documents, agents or their interelationships. The authors of XML schemas get to say such things; authors of RDF schemas don’t. This is neither good nor bad, just different. It’s a difference grounded in the essential differences between RDF and XML.
Pedantic aside: are XML documents not “in the world” too? Indeed so. We could imagine an RDF/OWL ontology that defined classes called things such as “Element”, “Attribute”, and other core concepts from XML. In fact this work has been done already; see the RDF Schema for the XML Infoset. It hasn’t been updated to use OWL, though, so it doesn’t capture many of the core generalisations about XML document structure. See also an early XML schema proposal, Document Content Description for XML. This was “an RDF vocabulary designed for describing constraints to be applied to the structure and content of XML documents”. Note that those constraints were largely invisible to RDF itself; RDF in DCD was a carrier for information about XML documents and sub-classes of XML document, but generic RDF/OWL tools wouldn’t be able to interpret such descriptions and constraints to classify documents into XML-valid or XML-invalid.
Getting back on topic: RDF schemas make generalisations about the world, XML schemas make generalisations about XML document types. In this way, most XML schema languages can be thought of as a special-purpose “ontology language” optimised for dealing with the domain of XML documents.
How does this relate to validation?
The concept of validation in the XML world is all about checking whether some input is (a) well-formed XML (b) structured according to the XML element/attribute rules of some XML schema. Does it have the right bits of information, in the right places, the right order? Are there any extra bits where there shouldn’t be? And so on.
The concept of validation in the RDF world is necessarily more permissive. First, just as in XML, we check basic syntax. In fact for XML-based RDF notations, like RDF/XML and XHTML2’s RDF/A, we do a lot of the same checking, since the same rules apply. But once we know we have an RDF graph, ie. a representation of a set of subject/predicate/object triples which make simple statements about the world, … what do we do next?
This is where the “liberating and infuriating” duality comes in. Freedom, horrible freedom. Once you’re past the “yup, it’s an RDF graph” stage, it isn’t always clear what kind of machine-checking to do next. The expectations we have from the world of XML schema (as well as from some OO-notions of class hierarchy modelling, and other sources) encourage us to look for a way to “validate” our RDF/XML documents against application schemas. So for example, we might think that we want to check a document against the RDF schema for Dublin Core, or FOAF, or RSS1, or MusicBrainz, Creative Commons, etc.
What could an RDF system do, to check that you’ve not screwed up when writing documents that uses these schemas? Basically, all it can do is look for contradictions, ie. where you make statements about the world that simply couldn’t be so. It might, for example, remind you that you’re saying that a
xyz:Document has a
geo:lat of “52.1”, yet the domain of
geo:SpatialThing are marked as
owl:disjointWith each other. In other words, that what you’re saying doesn’t fit with the meaning of the terms given in the
xyz: schemas. By ascribing a
geo:lat to a document you are implicitly claiming that it is a spatial thing, yet the claims in the schema disagree with this, since they use RDF/OWL to claim that nothing can be both an xyz:Document and a geo:SpatialThing at the same time.
This is a pretty simple example; things become more compelling as schemas get more complex. This sort of checking is useful in schema/ontology design, as much as for checking of instance documents: it is easy for a schema to embody a conceptual confusion. The community around W3C’s OWL ontology language have a lot of scientific know-how in this area – eg. checking huge, complex ontologies for mistake. Think about the potential for error when reasoning about aircraft parts, or in the lifesciences.
There are some “OWL Validators” out there that can do this sort of checking. For eg see Mindswap’s Online OWL consistency checker, built using their Pellet reasoner. There’s also a similar validator at Manchester.
BBN’s OWL validator is more concerned with checking the abstract (RDF-encoded) syntax of OWL, ie. it does not do full inference.
How useful is this sort of logical checking for simpler metadata applications? eg. RSS for data syndication, Dublin Core, FOAF etc. Well, it is a start. But it barely scratches the surface of what could be built on top of RDF. There are so many ways we can screw up our data, and only some of those are manifested as machine-checkable in the above manner. There are other forms of bad data than those that make logical errors.
For example, we could write (as many do), dc:author instead of dc:creator. That term isn’t defined by Dublin Core. Or we could spell a namespace wrong (I had http://xmlns.com/1.0/foaf/ instead of http://xmlns.com/0.1/foaf/ in the FOAF schema itself for a while; I found the mistake last week while using one of the OWL validators above). Checking for those kinds of errors is useful, and some RDF tools (eg. Cwm, Jena Eyeball) are increasingly offering more “lint”-style facilities. This trend is important, and a huge thing for RDF usability and deployment.
There is, however, another form of “RDF checking” that deserves much more attention and research, and which I’ll even dare to claim, may prove critical to getting widespread adoption of Semantic Web technology in the public Web.
This is the idea of checking our RDF/XML documents against descriptive patterns that capture application-specific information needs. In the Dublin Core community, these are called “application profiles”. In the XML world, Rick Jelliffe’s excellent Schematron system has led the way. The idea, roughly, is that real-world applications often have information needs that are not expressed in schema definitions, since they are not shared by all users of the schema. This is a natural side effect of the admirable urge to use common schemas (whether XML or RDF) across the globe, as well as an acknowledgement that documents and data aren’t static, but part of complex lifecycle in which different checks are appropriate in different environments. In the RDF world, such checking is also important, since we don’t have XML’s native concern for “missing” or “unexpected” chunks of data. We just have a graph, that is an unordered set of triples whose meaning is governed largely by schema-dictionary definitions of the property and class names used.
As Edd Dumbill put it,
Processing RDF is therefore a matter of poking around in this graph. Once a program has read in some RDF, it has a ball of spaghetti on its hands. You may like to think of RDF in the same way as a hashtable data structure — you can stick whatever you want in there, in whatever order you want.
Such liberating flexibility! Semi-structured chaos! How can on earth programmers be expected to deal with it? What OO coder would replace their nicely organised structures with a hashtable?
I’m always amused when RDF and the Semantic Web are misrepresented as an exercise in formalistic centralisation, as promoting an ivory tower “one perfect ontology for the planet” and so on. True, we have our formalists, and I for one am eternally grateful to them for the huge amount of work the formal KR guys put into cleaning up W3C’s RDF specs. But RDF, if conceived of as a naive attempt to create a machine-readable theory of everything, is tragically misunderstood. RDF is a strategy for principled decentralisation in a world where unanticipated data re-use, unanticipated data extensions, are valued. It is anything but centralised.
RDF says, to schema authors: “don’t model how you think your data should be written in XML, model its assumptions about the world”. It says, “don’t tell me what can go inside a
workplaceHomepage tag; tell me what kinds of things are related by the
workplaceHomepage property”. And to the authors of RDF/XML documents it says, “I won’t pretend to know your information needs better than you; put any RDF statements into the graph that you think are meaningful, useful, affordable and shareable”.
For this to work, it takes away some choices from schema authors. There are all kinds of assumptions that can accompany XML schemas, which RDF takes away from the authors of RDF schemas. Most obviously, it imposes a particular syntax. In RDF/XML (one of many concrete RDF notations) we have all those
rdf:resource attributes, alongside that sometimes-verbose “striped” notation. So schema authors, as they move from XML to RDF, give up their right to decide what XML tag structures to impose on their own users. They let RDF do that, and trust the RDF community to keep coming up with new and better notations that can be shared across all such schemas. Currently we have N3, the evolving XHTML2 RDF/A work and GRDDL as evidence that the RDF community take this responsibility seriously.
What else do schema authors give up? They give up their right to make rules couched in terms of data being missing versus present; defaults, for example. RDF doesn’t impose those kinds of restriction on the creators of RDF documents.
For example, looking at SMBmeta from Dan Bricklin (great name!). The smbmeta spec tells us:
The “country is assumed to be â€œusâ€ if omitted
For better or worse, you simply couldn’t say that in an RDF schema. To attempt to do so would be to misunderstand RDF entirely. RDF applications, given some other background knowledge, might in certain circumstances be justified in making that conclusion. But you can’t say “anybody who uses my schema to describe a Business, but omits the country code, is implicitly saying that the business is in the USA”.
Why? RDF is designed for open-world, data sharing apps, where content is syndicated, re-syndicated, merged, separated, shredded and glued back together. A single document might draw on a dozen different vocabularies, mixed tightly together at the elements and attributes level. It would be completely impractical for applications to have to read and understand the corresponding schemas, their idiosyncratic defaulting rules, and the interactions between those rules across schemas.
To make something like SMBmeta markup play nicely in the RDF world, we need to do more than simply analyse its XML structures and come up with a corresponding structure of RDF class and property names. We need to look deeper at the assumptions in the data, so that we don’t trample over the intended meaning of the XML when bringing it into the RDF environment. I took a look at making an SMBmeta file into RDF yesterday. More on that another time. The file got a bit uglier — the so-called RDF tax — to which it is tempting to say, “oh, we’ll just use a different RDF syntax”. But there’s a deeper issue, intimately bound up with the idea of document validity and data checking discussed here.
By moving a format to use RDF, we restrict the kinds of things a schema author can demand of his or her user. In RDF, you don’t get to say “if country is missing, assume USA”. This is why I consider GRDDL (the use of XSLT to go from non-RDF XML into RDF/XML) to be only part of the solution to the “Cambridge CommuniquÃ© problem of better connecting RDF to XML. So whenever we try to RDFize some XML format, we have to look very carefully for these kinds of assumptions, since they don’t play well in the RDF data-mixing world. The Atom syndication format is another example I’ve been looking at. It turns out that we can construct a document that meets the syntactic rules of both Atom and RDF/XML, ie. there is a profile of Atom could be used directly as RDF (see syntax check which will gripe about namespace prefixes). But before we rush around celebrating, we need to check the Atom spec very carefully, to see if the meaning of Atom terms makes sense in the RDF world (eg. defaulting rules — would RDF applications miss out on extra data? that’s bearable; would they misunderstand Atom data? that’s not).
What other constraints do we take away from XML schema authors? Here’s a big one: attaching meaning to XML element order. RDF graphs, like SQL relational tables, are unordered. This is because they correspond to logical assertions about the world. The importance of this, when trying to understand RDF, simply can’t be overestimated.
A couple of quick examples: in the RSS 1.0, we wanted the RDF content to include an ordered list of the items described in the feed. Because RDF doesn’t preserve XML element ordering after you’ve parsed the document, we had to find a way of putting that ordering information into the graph. The design we chose was to have an
rdf:Seq structure at the top of the document. More recently, I have this week been looking into the possibilities for reflecting GML into RDF. GML syntax looks a lot like RDF already, and I was able to make an example into parseable RDF very easily. However, GML often needs to describe ordered lists of points (eg. for polygons on a map). When I looked more closely at the RDFization design, I realised I had created meaningless data, since the RDF statements didn’t preserve the XML element ordering information that GML uses to link a set of points into a line. The following markup looks like XML/RDF, but doesn’t preserve enough information. There are other designs (eg. using RDF collections, or packing data into microsyntax-formatted strings) that need to be explored.
The exercise of “RDFizing” an XML format consists of more than mapping from a tree-based data structure into an edge-labelled graph one. We also have to go through and find out how the original format deals with things like missing data, defaulting rules, and the use of XML element order to carry meaning. Once in the RDF world, all vocabularies behave the same; we take some flexibility and choice away from schema authors, so that end-user applications can enjoy a different kind of flexibility and choice.
It is important to RDF that element order information can (usually) be thrown away. This design allows us to grab data from multiple sources, and integrate it relatively cheaply. But it is also grounded in the workings of logic and human communication. Consider that we can say in English something like:
There’s a person whose name is “Dan Brickley”, whose birthday is January 9th, and who works for the organisation which has a homepage identified by the URI <http://www.w3.org/> and which has a title “World Wide Web Consortium”.
Instead of that, we could have said something a bit different, but which would be true or false of the world under exactly the same circumstances:
A thing with a title of “World Wide Web Consortium”, identified by the URI <http://www.w3.org/> is a homepage of an organisation that a person born on January 9th called “Dan Brickley” works for.
Machines are really pretty bad at natural language parsing, because they lack commonsense. They can’t understand the vast hidden context behind our utterances. RDF won’t fix that. Ever. It is a design for data sharing in a world where machines have these terrible limitations.
Sometimes it makes sense to adjust our human practices to make things easier on computers, when we want the computers to do things we’d rather not do ourselves. So for example, we simplify our handwriting, so our PDAs can recognise the letters in our handwriting. RDF is a simplification of the practice of making basic claims about the world. The above two English-prose paragraphs are true in exactly the same circumstances, and false in exactly the same circumstances, regardless of the re-ordering of the claims.
W3C went to great lengths to make sure that RDF shares these logical characteristics, since it is fundamental to data-sharing on a global scale (which is what W3C is all about). Let’s see the RDF version written in XML:
<dc:title>World Wide Web Consortium</dc:title>
Here's another way we might have written those exact same 6 statements:
<title>World Wide Web Consortium</title>
In both cases, RDF sees only the underlying 3-part statements. Six of them:
- _:x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
- _:x <http://xmlns.com/foaf/0.1/name> “Dan Brickley” .
- _:x <http://xmlns.com/foaf/0.1/birthday> “01-09″ .
- <http://www.w3.org/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document> .
- <http://www.w3.org/> <http://purl.org/dc/elements/1.1/title> “World Wide Web Consortium” .
- _:li <http://xmlns.com/foaf/0.1/workplaceHomepage> <http://www.w3.org/> .
The RDF Semantics spec explains the maths behind this far more carefully and cleverly than I could ever manage. But the idea is simple. Just as the two paragraphs of English prose are equivalent, regardless of statement ordering, so are the bits of RDF.
These RDF statements are no more intrinsic, meaningful order than the rows in a relational database, or the files in a directory on your computer. They might, of course, be sorted on various criteria. But the schemas used (in this case, Dublin Core and FOAF), like all RDF schemas, do not impose application-specific meaning on the ordering of the XML elements. To risk over-emphasising a point, neither do the creators of Dublin Core, or of FOAF, get to make up rules for what happens if data is missing from the graph. Data costs time and money to manage, and there are thousands of reasons why data might usefully be missing from an RDF graph. RDF pushes such concerns down into the application layer: if you have an application which wants the birthdays of all employees to be listed, that’s your own business. It is a separate problem from that of defining a shared markup for birthdays, or for representing employment.
So let’s look again at that graph of 6 statements, but as a diagram, since that emphasises the unordered nature of the data (follow link for full-size image):
As you can see, the RDF abstraction normalises both of the XML/RDF forms given above into a single structure. This is another of those “love it or hate it” features of RDF. Developers sometimes complain that RDF has too many ways of writing the same thing. Flipped around, this is a feature: RDF provides an account of what seemingly diverse descriptions have in common.
Consider an application that was trying to collect the birthdays of W3C team members, perhaps to output in iCalendar format, or to link to our Amazon wishlists and suchlike. The application doesn’t need to care about lots of things. It doesn’t care if it loads data from RDF/XML, or N3, or XHTML2/RDFA, or from a GRDDL-based XSLT transformation. All it cares about is scooping up some statements about people, their “workplace homepage”, and their names and birthdays. In this simple example, all of that information can be expressed using a single RDF vocabulary, FOAF. But it is essential to realise that the RDF application’s information needs could very easily have drawn upon other RDF vocabularies (eg. it might be looking for birthdays of people who work for W3C Member organizations, and who have contributed to W3C standards. FOAF isn’t enough to express that, but mixed in with other schemas, much of that information is available in the Semantic Web already.
Forget about validation, RDF checking, and constraints for a while. Think about querying… about how an RDF application would express a request for some chunks of data (whether using one schema, or several). W3C has a new RDF query language for this, called SPARQL. Here’s how we would use SPARQL to ask for W3C team birthdays. If using the 6-statement example, we won’t get many answers. The query would be:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?name, ?bday
?p rdf:type foaf:Person .
?p foaf:name ?name .
?p foaf:birthday ?bday .
?p foaf:workplaceHomepage <http://www.w3.org/> .
ORDER BY ?name
The query describes a pattern to be matched against some RDF data, and is answered with a table of variable-to-value bindings, pretty much like SQL. The query looks a lot like an RDF graph, in fact, with “?” marking variables, ie. nodes in the graph where we don’t know the specific content of the node. Our application query here says “find me values for
?bday for anything (we’ll call it “p”) that has a type
foaf:Person and a
workplaceHomepage matching W3C’s homepage.
I don’t intend to give a SPARQL tutorial here. The official spec is the best place to start learning (and please do send review comments!); there are also some (slightly out of date) tutorial materials online. You might also look at the RDFAuthor tutorial, since RDFAuthor serves as an authoring tool both for RDF itself, and for RDF queries (using Squish, one of the languages that fed into the SPARQL design). That tutorial makes one point quite clearly: RDF queries and RDF graphs are pretty similar structures. If you delve into the details of SPARQL, you’ll find some points at which it departs the simple-minded world of RDF triples. It lets you ask queries, for example, where you say things about the source of the graph, as well as queries in which certain properties are marked as optional (eg. we might ask for photographic depictions of the people in the query, yet not want the query to fail if those bits of data weren’t available in the graph). Both of these features are useful and were expected when the SPARQL work began, but they move us away from the simple narrative of “RDF queries are just like RDF graphs with bits of graph marked as missing“. As a quick intro to RDF querying with SPARQL, that concept is worth hanging onto. And it also helps us think about the relationship between RDF query and RDF “validation” or data checking.
I mentioned the Schematron system earlier. Schematron is useful when thinking about how new kinds of RDF checking and validation might work. It is built on top of the XPath spec. By testing to see whether specific XPath addresses match against some target document, we can probe the contents to see whether they meet our application needs or not. The tagline on the Schematron site says it all: Schematron is a language for making assertions about patterns found in XML documents. A few words from the overview page, which describes how
you can develop and mix two kinds of schemas:
- Report elements allow you to diagnose which variant of a language you are dealing with.
- Assert elements allow you to confirm that the document conforms to a particular schema.
The approach is pretty simple:
- First, find a context nodes in the document (typically an element) based on XPath path criteria;
- Then, check to see if some other XPath expressions are true, for each of those nodes.
That’s it, really. Schematron schemas allow you to check XML documents against rules that go beyond those that come with the elements and attributes used by the document. You might, for example, be combining various namespaces, and want to have them combined in a certain markup pattern. Or you might want to apply different data-integrity checks at different points in a workflow.
This idea is powerful, and simple. I’ve long wanted the same approach, but defined over the RDF. Some time ago, Libby Miller prototyped it as Schemarama, using the Squish RDF query language. XPath doesn’t make much sense for a Schematron-for-RDF, since we want something that can be used against RDF graphs, rather than XML documents. Now that the SPARQL language is more or less finishing, I think it’s time to revisit the approach, but using SPARQL instead. SPARQL’s
OPTIONALs mechanism, I think, makes it much more practical. There are also possibilities for using OWL’s data structures in a similar way; see Damian Steer’s Using OWL for Forms, Validation, and Application Profiles for RDF XTech paper. As he says,
One of the most common issues encountered by RDF developers is the need for some form of constraint on their data, and particulary validation. Unfortunately the RDF schema languages are (for perfectly good reasons) unsuited for this purpose. For example, property ranges are commonly misunderstood by newcomers to RDF as restricting possible values.
I’m not yet sure which path will be most fruitful. OWL isn’t really meant to express such application-oriented constraints, but the data structures look temptingly useful. I do lean towards layering a next-generation RDF checker on top of SPARQL facilities, for one reason. Queries capture application usage practice. They tell us what the application wants. The Semantic Web requires something of a cultural shift here, away from the idea that “validation” is solely the act of comparing some instance data against the schemas it uses. Those schemas (even if they use OWL ontology extensions) simply don’t provide information to do enough useful checking. Instead of this, we need to explore ways of sharing machine-readable characterisations of common RDF graph structures. In Schematron’s terms, we would build data-checking applications on top of a language for making assertions about patterns found in RDF graphs. This could all be layered on top of SPARQL, or on OWL, or on some new rule language. I don’t care how it is done, so long as it happens! When RDF is projected back into an XML environment, Schematron itself could even be used…
The absense of this kind of data-checking is hurting RDF deployment, because it discourages RDF vocabulary designers from re-using other schemas in the instance data that their applications produce and consume.
Revisiting the example application considered above, imagine we have built some applications based around the idea of finding people’s birthday, name, and photo. We might use FOAF to describe the people info, Dublin Core to describe the image, Creative Commons to describe the usage rights for the image, and so forth. While W3C has a language for describing the meaning of the individual terms (ie. classes and properties) we’d use in our descriptions, it doesn’t yet have a standard way of capturing the “descriptive recipe” deployed in the application. We don’t have a non-prose (and hence language-neutral) way of saying how those pieces of RDF vocabulary are combined into a larger, and more useful, data structure. We don’t have a way of checking instance data against it, so that an “birthday photo app” validator could probe the RDF graph and say, “that file is valid for this purpose“.
The cultural shift we need, and the toolset to accompany it, is a shift to application-oriented validation. Instead of an absolute, universal “yes” or “no” for some RDF document, we need a more nuanced approach. An RDF document might be well-suited for use in a photo metadata application, but missing some data that is needed for an addressbook. The existance of a common framework for expressing such information needs would go a long way to addressing the “rummaging in spaghetti” feeling that developers have when they work with RDF, as well as the desire that application authors have for expressing XML-esque constraints (eg. the FOAFnet subset profile of FOAF).
The approach could be a simple as having online catalogues of “common queries”, so that datasources and applications could share a way of talking about the patterns of RDF that they produce and consume. Or it could be a lot fancier, there’s a lot to explore, and a lot to gain. SPARQL is probably the place to start exploring…
I’ve finally updated and customised WordPress, tweaked the links to use RSS1, and discovered that I can get category-specific feeds, eg. technology.
Planet RDF is now taking that category feed (thanks Dave!), which allows me to vent freely on other things without worrying too much about cluttering up a predominantly tech-oriented site. That said, I find the glimpses into people’s non-tech lives to be part of the charm of the site. So I’ll probably include the odd conspiracy theory or other semi-random thing in the Technology section from time to time, since all topics ultimately overlap…
As a contrast to the GML/KML and Google-related posts, here is an annotated Yahoo! map, derrived from geo-extended RSS 2.0 markup. I tried feeding the service a variant of RSS 1.0 last week (albeit with the Yahoo! extensions implicitly in the RSS namespace) and it seemed to work. They don’t yet have worldwide coverage, unfortunately. [via flickr thread]
I’ve been trying to get lat/long GPS data embedded in my photos, before I upload them to Flickr, so that geobloggers.com will make use of the data. So far, I can only get that site to use explicit “geo:lat=123.345″ based flickr-tagging; embedded EXIF seems ignored. See ongoing discussion in the Flickr GeoTagging group.
I spent some time yesterday talking with Ron Lake about GML, RDF, RSS and other acronyms. GML was originally an RDF application, and various RDFisms can still be seen in the design. I learned a fair bit about GML, and about its extensibility and profiling mechanisms.
We discussed some possibilities for sharing data between GML, RSS/Atom and RDF environments. In particular, two options: RDF inside GML; and RDFized GML.
The possibility of embedding islands of RDF inside GML (eg. the GML for a restaurant might use RDF for restaurant-review or menu markup) is interesting, as would allow GML documents to use any RDF vocabulary to describe the features on a map. Currently, such extension data typically requires the creation of a custom XML Schema. The other option, “RDFized GML”, is to explore the creation of an RDF vocabulary that allows some useful subset of GML data to be used in RDF. I’ll come back to this in a minute.
While GML comes from the world of professional GIS, its influence is being felt more widely: Google Earth (formerly Keyhole) uses something called KML, which bears a great many similarities with GML. Meanwhile in the RDF and RSS/Atom world, the very basic addition of “geo:lat” and “geo:long” tagging (sometimes using the W3C SemWeb IG WGS_84 namespace) has got a number of toolmakers interested. This year has seen the release of Yahoo! Maps, Google Maps, Google Earth and most recently Microsoft Virtual Earth. We’ve also seen the release of the excellent Mapping Hacks book, and increasing interest in this area from Web developers.
Although the experimental SWIG RDF vocabulary only deals with points described in WGS_84, there have been various discussions on possible extensions (eg. RDFGeom-2d from Chris Goad). These are intriguing, but we should be careful to avoid re-inventing wheels. Basically, I think we have all the ingredients for a hybrid approach: an RDFized GML subset designed for use by Web developers alongside RSS/Atom, FOAF and other public-facing XML formats. GML serves well as a data format in the GIS community, but some work is needed to find a subset that will find adoption in the wider Web.
The tiny W3C SWIG vocab, and related geo:lat/long tagging of “geo”-RSS feeds has shown that there is real interest in a lightweight XML-based mechanism for sharing map-related markup. GML shows us (via a 600 page specification, for GML 3.1) quite how rich and complex a problem space we’re facing, and KML demonstrates that a medium-sized “GML lite” subset can get traction with webmasters and developers, when backed by useful tools and services.
There are two pieces of work to do here (setting aside for now the topic of RDF islands within GML documents). Let’s first find a strawman profile of GML. From my limited knowledge and discussion with others, something “GML 2-ish” but profiled against GML 3.1, is the area to explore. Then we try getting those data structures into RDF, so it can mix freely with other information.
I understand from Ron Lake that profiling is something that is actively encouraged for GML, and there are even tools to support it that come with the spec: have a look at subsetutility.zip. These files (thanks Ron!) show a pretty easy path for experimentation with profiles. In addition to the schema subsetting utilities, the .zip also includes (just as an example to help me understand GML) an example application schema CommonObjects.xsd, showing how to define things like ‘Building’, ‘River’, and a sample instance .xml file that uses it.
To use the profiling tool, just put the unzipped files directly in the base/ directory of .xsd files that ships with GML 3.1, then run an XSLT processor to generate a GML subset.
xsltproc depends.xsl gml.xsd > _gml.dep
xsltproc GML3.1.1Subset.xsl _gml.dep > _gmlSubset.xsd
…and that’s your profile. The scripts take care of all the dependencies (ie. they’ll read the 29 XML Schemas, so you don’t have to :)
The bits of GML you want are specified as parameters in GML3.1.1Subset.xsl. The default in this .zip is: gml:Point, gml:LineString, gml:Polygon, gml:LinearRing, gml:Observation, gml:TimeInstant, gml:TimePeriod
I’m no GML expert, but if someone can help get some instance data matching such a profile, I’ll have a go at RDFizing it. Also, of course, it will be useful to debate how many facilities from full GML would find use in the Webmaster (RSS, KML etc) scene.
Disclaimer: for now this is purely an informal collaboration. If we make something interesting, it might be worth investigation of something more formal between W3C (home of RDF, and where I work) and OGC (home of GML). For now, let’s just try out some ideas…
Another screengrab, this time showing photos overlaid using the geobloggers KML feed, derrived from Flickr tagged images.
RDF versus GML, Chris Goad (Sept 2004),
GML is the XML language for geography developed by the Open GIS consortium. The third major revision of this specification, known as GML3, was released in January of 2003. RDFMap, when used in conjuction with RDFGeom, constitutes an attempt to develop an alternative approach based on RDF to expressing geographical information. This note outlines the relationship between this approach and GML, and suggests techniques for converting data between the two formalisms. The important differences between RDFMap and GML derive from the choice of RDF, and would apply equally to other RDF-based formalisms for geography.
Interesting comparison. I’m somewhat wary of over-hyping RDF’s offering here. There’s certainly value in looking how GML application schemas might be reflected into RDF vocabularies (and vice-versa). Chris’s article ends with an example taken from the GML spec, represented in RDF. I wonder about going back the other way too, showing RDF properties used inside a GML document.
This is the hidden gem of Google Earth. Adding a “Network Link” allows you to fetch KML data from remote servers. It does this in two ways, Time Based or Location Based. So *anyone* can add dynamic data to Google Maps.
Apparently KML is based on GML. I don’t know Keyhole/Google’s work differs. There seems to be a role here for something simple enough for Google Earth, WorldWind, and other viewer apps to use, when consulting a remote server for info about some area of interest. Maybe it’s GML Web Feature Servers, maybe KML, maybe geo-extended RSS/Atom, or perhaps generic query interfaces like the SPARQL protocol. SOAP, WSDL and REST fit in the picture somewhere. Probably, various things will be used in different environments, depending on application emphasis. We might be looking up the opening-hours of a shop, contact information for an organization, or jobs, events, photos, blog posts, FOAF profiles etc in a certain area, … it isn’t clear where the line is drawn between ‘geographic’ data and the wider unbounded collection of information about the world. GML has strengths at the geographical end of the spectrum, RDF (and its query system, SPARQL) has strengths at the generic, domain-neutral end. RSS/Atom is serving well as a generic carrier for data syndication. It isn’t clear to me yet where KML fits (or SVG, for that matter), but work on the relationship between GML and RDF would seem timely.
The geobloggers post has examples and links to flickr and del.icio.us-based services that expose this interface. I’m going to try making such a service on top of SPARQL…