Obama for middle-managers

(inspired by the ‘Yes we can’ powerpoint slides…)

We Will

  • act – not only to create new jobs, but to lay a new foundation for growth
  • build the roads and bridges, the electric grids and digital lines that feed our commerce and bind us together
  • restore science to its rightful place, and wield technology’s wonders to raise health care’s quality and lower its cost
  • harness the sun and the winds and the soil to fuel our cars and run our factories
  • transform our schools and colleges and universities to meet the demands of a new age
  • begin to responsibly leave Iraq to its people
  • work tirelessly to lessen the nuclear threat, and roll back the spectre of a warming planet
  • [To those who cling to power through corruption and deceit and the silencing of dissent], extend a hand if you are willing to unclench your fist
  • [for those who seek to advance their aims by inducing terror and slaughtering innocents], defeat you

We Will Not…

  • give [those ideals] up for expedience’s sake
  • apologise for our way of life, nor will we waver in its defence

Family trees, Gedcom::FOAF in CPAN, and provenance

Every wondered who the mother(s) of Adam and Eve’s grand-children were? Me too. But don’t expect SPARQL or the Semantic Web to answer that one! Meanwhile, …

You might nevetheless care to try the Gedcom::FOAF CPAN module from Brian Cassidy. It can read Gedcom, a popular ‘family history’ file format, and turn it into RDF (using FOAF and the relationship and biography vocabularies). A handy tool that can open up a lot of data to SPARQL querying.

The Gedcom::FOAF API seems to focus on turning the people or family Gedcom entries  into their own FOAF XML files. I wrote a quick and horrid Perl script that runs over a Gedcom file and emits a single flattened RDF/XML document. While URIs for non-existent XML files are generated, this isn’t a huge problem.

Perhaps someone would care to take a look at this code and see whether a more RDFa and linked-data script would be useful?

Usage: perl gedcom2foafdump.pl BUELL001.GED > _sample_gedfoaf.rdf

The sample data I tested it on is intriguing, though I’ve not really looked around it yet.

It contains over 9800 people including the complete royal lines of England, France, Spain and the partial royal lines of almost all other European countries. It also includes 19 United States Presidents descended from royalty, including Washington, both Roosevelts, Bush, Jefferson, Nixon and others. It also has such famous people as Brigham Young, William Bradford, Napoleon Bonaparte, Winston Churchill, Anne Bradstreet (Dudley), Jesus Christ, Daniel Boone, King Arthur, Jefferson Davis, Brian Boru King of Ireland, and others. It goes all the way back to Adam and Eve and also includes lines to ancient Rome including Constantine the Great and ancient Egypt including King Tutankhamen (Tut).

The data is credited to Matt & Ellie Buell, “Uploaded By: Eochaid”, 1995-05-25.

Here’s an extract to give an idea of the Gedcom form:

0 @I4961@ INDI
1 NAME Adam //
1 SEX M
1 REFN +
1 BIRT
2 DATE ABT 4000 BC
2 PLAC Eden
1 DEAT
2 DATE ABT 3070 BC
1 FAMS @F2398@
1 NOTE He was the first human on Earth.
1 SOUR Genesis 2:20 KJV
0 @I4962@ INDI
1 NAME Eve //
1 SEX F
1 REFN +
1 BIRT
2 DATE ABT 4000 BC
2 PLAC Eden
1 FAMS @F2398@
1 SOUR Genesis 3:20 KJV

It might not directly answer the great questions of biblical scholarship, but it could be a fun dataset to explore Gedcom / RDF mappings with. I wonder how it compares with Freebase, DBpedia etc.

The Perl module is a good start for experimentation but it only really scratches the surface of the problem of representing source/provenance and uncertainty. On which topic, Jeni Tennison has a post from a year ago that’s well worth (re-)reading.

What I’ve done in the above little Perl script is implement a simplification: instead of each family description being its own separate XML file, they are all squashed into a big flat set of triples (‘graph’). This may or may not be appropriate, depending on the sourcing of the records. It seems Gedcom offers some basic notion of ‘source’, although not one expressed in terms of URIs. If I look in the SOUR(ce) field in the Gedcom file, I see information like this (which currently seems to be ignored in the Gedcom::FOAF mapping):

grep SOUR BUELL001.GED | sort | uniq

1 NOTE !SOURCE:Burford Genealogy, Page 102 Cause of Death; Hemorrage of brain
1 NOTE !SOURCE:Gertrude Miller letter “Harvey Lee lived almost 1 year. He weighed
1 NOTE !SOURCE:Gertrude Miller letter “Lynn died of a ruptured appendix.”
1 NOTE !SOURCE:Gertrude Miller letter “Vivian died of a tubal pregnancy.”
1 SOUR “Castles” Game Manuel by Interplay Productions
1 SOUR “Mayflower Descendants and Their Marriages” pub in 1922 by Bureau of
1 SOUR “Prominent Families of North Jutland” Pub. in Logstor, Denmark. About 1950
1 SOUR /*- TUT
1 SOUR 273
1 SOUR AHamlin777.  E-Mail “Descendents of some guy
1 SOUR Blundell, Sherrie Lea (Slingerland).  information provided on 16 Apr 1995
1 SOUR Blundell, William, Rev. Interview on Jan 29, 1995.
1 SOUR Bogert, Theodore. AOL user “TedLBJ” File uploaded to American Online
1 SOUR Buell, Barbara Jo (Slingerland)
1 SOUR Buell, Beverly Anne (Wenge)
1 SOUR Buell, Beverly Anne (Wenge).  letter addressed to Kim & Barb Buell dated
1 SOUR Buell, Kimberly James.
1 SOUR Buell, Matthew James. written December 19, 1994.
1 SOUR Burnham, Crystal (Harris).  Leter sent to Matt J. Buell on Mar 18, 1995.
1 SOUR Burnham, Crystal Colleen (Harris).  AOL user CBURN1127.  E-mail “Re: [...etc.]

Some of these sources could be tied to cleaner IDs (eg. for books c/o Open Library, although see ‘in search of cultural identifiers‘ from Michael Smethurst).

I believe RDF’s SPARQL language gives us a useful tool (the notion of ‘GRAPH’) that can be applied here, but we’re a long way from having worked out the details when it comes to attaching evidence to claims. So for now, we in the RDF scene have a fairly course-grained approach to data provenance. Databases are organized into batches of triples, ie. RDF statements that claim something about the world. And while we can use these batches – aka graphs – in our queries, we haven’t really figured out what kind of information we want to associate with them yet. Which is a pity, since this could have uses well beyond family history, for example to online journalistic practices and blog-mediated fact checking.

Nearby in the Web: see also the SIOC/SWAN telecons, a collaboration in the W3C SemWeb lifescience community around the topic of modelling scientific discourse.

FRBR and W3C Media Annotations

Just spotted this review of FRBR (Functional Requirements for Bibliographic Records) in the group’s wiki. Also some interesting notes on modelling. It seems the Media Annotation work is starting out well, both in terms of the analysis they’re performing, the relationships they’re seeing (FRBR is from the Library world and often passed over by industry metadata efforts). But things also look healthy in that they’re working on a public list, with notes kept in a public wiki. This is a much healthier W3C than the one I first encountered in ’97, when everything was somewhat hidden away. Unfortunately, attitudes linger – it is still common to see W3C critiqued for being a closed, secretive body, when the truth is that many WGs operate in a very public, collaborative manner.

Cross-browsing and RDF

Cross-browsing and RDF

While cross-searching has been described and demonstrated through this paper and associated work, the problem of cross-browsing a selection of subject gateways has not been addressed. Many gateway users prefer to browse, rather than search. Though browsing usually takes longer than searching, it can be more thorough, as it is not dependent on the users terms matching keywords in resource descriptions (even when a thesaurus is used, it is possible for resources to be “missed” if they are not described in great detail).

As a “quick fix”, a group of gateways may create a higher level menu that points to the various browsable menus amongst the gateways. However, this would not be a truly hierarchical menu system, as some gateways maintain browsable resource menus in the same atomic (or lowest level) subject area. One method of enabling cross-browsing is by the use of RDF.

The World Wide Web Consortium has recently published a preliminary draft specification for the Resource Description Framework (RDF). RDF is intended to provide a common framework for the exchange of machine-understandable information on the Web. The specification provides an abstract model for representing arbitrarily complex statements about networked resources, as well as a concrete XML-based syntax for representing these statements in textual form. RDF relies heavily on the notion of standard vocabularies, and work is in progress on a ‘schema’ mechanism that will allow user communities to express their own vocabularies and classification schemes within the RDF model.

RDF’s main contribution may be in the area of cross-browsing rather than cross-searching, which is the focus of the CIP. RDF promises to deliver a much-needed standard mechanism that will support cross-service browsing of highly-organised resources. There are many networked services available which have classified their resources using formal systems like MeSH or UDC. If these services were to each make an RDF description of their collection available, it would be possible to build hierarchical ‘views’ of the distributed services offering a user interface organised by subject-classification rather than by physical location of the resource.

From Cross-Searching Subject Gateways, The Query Routing and Forward Knowledge Approach, Kirriemuir et. al., D-Lib Magazine, January 1998.

I wrote this over 11 (eleven) years ago, as something of an aside during a larger paper on metadata for distributed search. While we are making progress towards such goals, especially with regard to cross-referenced descriptions of identifiable things (ie. the advances made through linked data techniques lately), the pace of progress can be quite frustrating. Just as it seems like we’re making progress, things take a step backwards. For example, the wonderful lcsh.info site is currently offline while the relevant teams at the Library of Congress figure out how best to proceed. It’s also ten years since Charlotte Jenkins published some great work on auto-classification that used OCLC’s Dewey Decimal Classification. That work also ran into problems, since DDC wasn’t freely available for use in such applications. In the current climate, with Creative Commons, Open source, Web 2.0 and suchlike the rage, I hope we’ll finally see more thesaurus and classification systems opened up (eg. with SKOS) and fully linked into the Web. Maybe by 2019 the Web really will be properly cross-referenced…

OpenID – a clash of expectations?

Via Dan Connolly, this from the mod_auth_openid FAQ:

Q: Is it possible to limit login to some users, like htaccess/htpasswd does?

A: No. It is possible to limit authentication to certain identity providers (by using AuthOpenIDDistrusted and AuthOpenIDTrusted, see the main page for more info). If you want to restrict to specific users that span multiple identity providers, then OpenID probably isn’t the authentication method you want. Note that you can always do whatever vetting you want using the REMOTE_USER CGI environment variable after a user authenticates.

Funny, this is just what I thought was most interesting about OpenID: it lets you build sites where you can offer a varying experiences (including letting them in or not) to differ users based on what you know about them. OpenID itself doesn’t do everything out of the box, but by associating public URIs with people, it’s a very useful step.

A year ago I sketched a scenario in this vein (and it seems to have survived sanity check from Simon Willison, or at least he quotes it). It seems perhaps that OpenID is all things to all people…?

SKOS deployment stats from Sindice

This cropped up in yesterday’s W3C Semantic Web Coordination Group telecon, as we discussed the various measures of SKOS deployment success.

I suggested drawing a distinction between the use of SKOS to publish thesauri (ie. SKOS schemes), and the use of SKOS in RDFS/OWL schemas, for example subclassing of skos:Concept or defining properties whose range or domain are skos:Concept. A full treatment would look for a variety of constructs (eg. new properties that declare themselves subPropertyOf something in SKOS).

An example of such a use of SKOS is the new sioc:Category class, recently added to the SIOC namespace.

Here are some quick experiments with Sindice.

Search results for advanced “* <http://www.w3.org/2000/01/rdf-schema#domain> <http://www.w3.org/2004/02/skos/core#Concept>”, found 10

Search results for advanced “* <http://www.w3.org/2000/01/rdf-schema#range> <http://www.w3.org/2004/02/skos/core#Concept>”, found 10

Search results for advanced “* <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://www.w3.org/2004/02/skos/core#Concept>”, found 18

Here’s a query that finds all mentions of skos:Concept in an object role within an RDF statement:

Search results for advanced “* * <http://www.w3.org/2004/02/skos/core#Concept>”, found about 432.32 thousand

This all seems quite healthy, although I’ve not clicked through to explore many of these results yet.

BTW I also tried using the proposed (but retracted – see CR request notes) new SKOS namespace,
http://www.w3.org/2008/05/skos#Concept (unless I’m mistaken). I couldn’t find any data in Sindice yet that was using this namespace.