Family trees, Gedcom::FOAF in CPAN, and provenance

Every wondered who the mother(s) of Adam and Eve’s grand-children were? Me too. But don’t expect SPARQL or the Semantic Web to answer that one! Meanwhile, …

You might nevetheless care to try the Gedcom::FOAF CPAN module from Brian Cassidy. It can read Gedcom, a popular ‘family history’ file format, and turn it into RDF (using FOAF and the relationship and biography vocabularies). A handy tool that can open up a lot of data to SPARQL querying.

The Gedcom::FOAF API seems to focus on turning the people or family Gedcom entries  into their own FOAF XML files. I wrote a quick and horrid Perl script that runs over a Gedcom file and emits a single flattened RDF/XML document. While URIs for non-existent XML files are generated, this isn’t a huge problem.

Perhaps someone would care to take a look at this code and see whether a more RDFa and linked-data script would be useful?

Usage: perl gedcom2foafdump.pl BUELL001.GED > _sample_gedfoaf.rdf

The sample data I tested it on is intriguing, though I’ve not really looked around it yet.

It contains over 9800 people including the complete royal lines of England, France, Spain and the partial royal lines of almost all other European countries. It also includes 19 United States Presidents descended from royalty, including Washington, both Roosevelts, Bush, Jefferson, Nixon and others. It also has such famous people as Brigham Young, William Bradford, Napoleon Bonaparte, Winston Churchill, Anne Bradstreet (Dudley), Jesus Christ, Daniel Boone, King Arthur, Jefferson Davis, Brian Boru King of Ireland, and others. It goes all the way back to Adam and Eve and also includes lines to ancient Rome including Constantine the Great and ancient Egypt including King Tutankhamen (Tut).

The data is credited to Matt & Ellie Buell, “Uploaded By: Eochaid”, 1995-05-25.

Here’s an extract to give an idea of the Gedcom form:

0 @I4961@ INDI
1 NAME Adam //
1 SEX M
1 REFN +
1 BIRT
2 DATE ABT 4000 BC
2 PLAC Eden
1 DEAT
2 DATE ABT 3070 BC
1 FAMS @F2398@
1 NOTE He was the first human on Earth.
1 SOUR Genesis 2:20 KJV
0 @I4962@ INDI
1 NAME Eve //
1 SEX F
1 REFN +
1 BIRT
2 DATE ABT 4000 BC
2 PLAC Eden
1 FAMS @F2398@
1 SOUR Genesis 3:20 KJV

It might not directly answer the great questions of biblical scholarship, but it could be a fun dataset to explore Gedcom / RDF mappings with. I wonder how it compares with Freebase, DBpedia etc.

The Perl module is a good start for experimentation but it only really scratches the surface of the problem of representing source/provenance and uncertainty. On which topic, Jeni Tennison has a post from a year ago that’s well worth (re-)reading.

What I’ve done in the above little Perl script is implement a simplification: instead of each family description being its own separate XML file, they are all squashed into a big flat set of triples (‘graph’). This may or may not be appropriate, depending on the sourcing of the records. It seems Gedcom offers some basic notion of ‘source’, although not one expressed in terms of URIs. If I look in the SOUR(ce) field in the Gedcom file, I see information like this (which currently seems to be ignored in the Gedcom::FOAF mapping):

grep SOUR BUELL001.GED | sort | uniq

1 NOTE !SOURCE:Burford Genealogy, Page 102 Cause of Death; Hemorrage of brain
1 NOTE !SOURCE:Gertrude Miller letter “Harvey Lee lived almost 1 year. He weighed
1 NOTE !SOURCE:Gertrude Miller letter “Lynn died of a ruptured appendix.”
1 NOTE !SOURCE:Gertrude Miller letter “Vivian died of a tubal pregnancy.”
1 SOUR “Castles” Game Manuel by Interplay Productions
1 SOUR “Mayflower Descendants and Their Marriages” pub in 1922 by Bureau of
1 SOUR “Prominent Families of North Jutland” Pub. in Logstor, Denmark. About 1950
1 SOUR /*- TUT
1 SOUR 273
1 SOUR AHamlin777.  E-Mail “Descendents of some guy
1 SOUR Blundell, Sherrie Lea (Slingerland).  information provided on 16 Apr 1995
1 SOUR Blundell, William, Rev. Interview on Jan 29, 1995.
1 SOUR Bogert, Theodore. AOL user “TedLBJ” File uploaded to American Online
1 SOUR Buell, Barbara Jo (Slingerland)
1 SOUR Buell, Beverly Anne (Wenge)
1 SOUR Buell, Beverly Anne (Wenge).  letter addressed to Kim & Barb Buell dated
1 SOUR Buell, Kimberly James.
1 SOUR Buell, Matthew James. written December 19, 1994.
1 SOUR Burnham, Crystal (Harris).  Leter sent to Matt J. Buell on Mar 18, 1995.
1 SOUR Burnham, Crystal Colleen (Harris).  AOL user CBURN1127.  E-mail “Re: [...etc.]

Some of these sources could be tied to cleaner IDs (eg. for books c/o Open Library, although see ‘in search of cultural identifiers‘ from Michael Smethurst).

I believe RDF’s SPARQL language gives us a useful tool (the notion of ‘GRAPH’) that can be applied here, but we’re a long way from having worked out the details when it comes to attaching evidence to claims. So for now, we in the RDF scene have a fairly course-grained approach to data provenance. Databases are organized into batches of triples, ie. RDF statements that claim something about the world. And while we can use these batches – aka graphs – in our queries, we haven’t really figured out what kind of information we want to associate with them yet. Which is a pity, since this could have uses well beyond family history, for example to online journalistic practices and blog-mediated fact checking.

Nearby in the Web: see also the SIOC/SWAN telecons, a collaboration in the W3C SemWeb lifescience community around the topic of modelling scientific discourse.

5 Responses to Family trees, Gedcom::FOAF in CPAN, and provenance

  1. john225 says:

    Thanks for this. I’d been meaning to write such a script myself after my folks sent me a copy of the family tree. I tried your script last night and it seemed to work just fine, but I’ll check it more tonight.

    Will also be interesting to hook the RDF up to a family ontology like

    http://www.owldl.com/ontologies/family.owl

    and I might well try this later. I suspect the ontology and/or RDF will need some adjustment to ensure they work together.

  2. eubanksb says:

    Great work, Dan! Do you know of any community effort creating a comprehensive genealogy ontology?

    Any work in this area should include multiple reification levels.
    In other words, “Danbri” says “census122454345″ documents birthdate of “person-uri-3287638″ as “12 June 1743″.

  3. I’ve also been working on genealogy micro formats by using vCard, vCalendar, and XFN. You can see a description and example of genealogy hosting at http://gedview.com/microxfn/

  4. I need to try this with my GEDCOM and use it in a project that I have. It is a Genealogy website that is based on MediaWiki and the Semantic MediaWiki bundle. There is a RDFIO extension that will import RDF data. This is a good start for including a great deal of data into the wiki. The question is whether or not the large dump file will try to create one page or individual pages for the different Individuals.
    My site is at: http://my-family-lineage.com/w/
    It was featured as November 2011 Semantic MediaWiki of the month. It still has a long way to go though. There is a need to include source information, mapping data, and much more. I also have a problem with image uploads.
    The forms and templates should make it user friendly though.
    Bruce

  5. I’d also like to know about a comprehensive genealogy ontology. I am using the vocabularies, FOAF, REL – dealing with Relationships and BIO: Biographic information. Brian Cassady’s module does the same.
    Bruce

Leave a Reply