Disambiguating with DBpedia

Sketchy notes. Say you’re looking for an identifier for something, and you know it’s a company/organization, and you have a label “Woolworths”.

What can be done to choose amongst the results we find in DBpedia for this crude query?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?x where {
?x a <http://dbpedia.org/ontology/Organisation>;  rdfs:label ?l .
FILTER(REGEX(?l, “Woolworths*”)).

More generally, are the tweaks and tricks needed to optimise this sort of disambiguation going to be cross-domain, or do we have to hand-craft them, case by case?

Published by danbri

Click here to type

Join the Conversation


  1. SPARQL doesn’t seem like the best solution since even tracking down the correct predicate URI is non-trivial across domains.

    I would rather use an HTML search interface to find a Web document representation of the thing and then deduce the “real world” URI from there. Ideally, this URI would be included in the HTML Web document representation and branded somehow for easy identification similar to RDF icons (http://www.w3.org/RDF/icons/) or microformats (http://microformats.org/wiki/icons). Presumably, this URI will be a hash or 303 redirect back to the page being viewed, so it might be nice to have some JavaScript functionality built around it for various purposes like copy-paste or variant format selection.

  2. Yes, for one-offs, nothing beats a human. I should have mentioned another constraint – dealing with 1000s of these, so some desire for automation. Halfway house is to use Amazon Mechanical Turk somewhere in the workflow.

  3. Without using full context and linked data extraction techniques to make informed opinions, there isn’t too much you can do. However I did knock up this little query earlier in the year which get’s a good match on what you’re looking for in the #1 slot most of the time – certainly works for me:

    SELECT DISTINCT ?uri as ?s WHERE {{
    SELECT (bif:either( bif:isnull(?redir) , ?s , ?redir )) as ?uri ?sumscore ( bif:either( bif:isnull(?redir) , ?iriScore , ?ririScore )) as ?wiriScore WHERE {
    SELECT DISTINCT ?s SUM(?ascore) as ?sumscore ((?s)) as ?iriScore ((?s2)) as ?ririScore WHERE
    { ?s ?p ?o . FILTER( lang(?o) = "en" ) . ?o bif:contains 'woolworths' option (score ?ascore) }
    { ?s2 ?p ?o . FILTER( lang(?o) = "en" ) . ?o bif:contains 'woolworths' option (score ?ascore) . ?s ?s2 }
    GROUP BY ?s ?s2
    OPTIONAL { ?s ?redir } .
    OPTIONAL { ?s rdf:type ?type } . FILTER ( bif:either( bif:isnull(?type) , "" , ?type ) != ) .
    FILTER( bif:strcasestr( ?s , "woolworths" ) ) .
    GROUP BY ?uri
    ORDER BY desc( MAX(?wiriScore) )

    resulting in:

  4. Dan,
    That is a very good question and I think that there is no definitive answer. The best choice of method depends on the application and on the information that you have available besides the (part of) label “Woolworths”. Do you have a paragraph where “Woolworths” is mentioned and you can use the words to provide context (Annotation/Extraction)? Or are you looking at a database and you have other attributes of that entity (Record Linkage)? Or you have nothing else and you want the most likely entity?

    In the next few weeks we will release software and data to help some of these use cases. Keep an eye out for announcements from the DBpedia team. :)

  5. I’ve been thinking I’d like a tool that ‘d let you give it some descriptions of resources manually linked with owl:sameAs and owl:differentFrom, and it would work out what was sufficiently similar about the sameAses for you so you could automate the linking.

Leave a comment