Everything Still Looks Like A Graph (but graphs look like maps)

Last October I posted a writeup of some experiments that illustrate item-to-item similarities from Apache Mahout using Gephi for visualization. This was under a heading that quotes Ben Fry, “Everything looks like a graph” (but almost nothing should ever be drawn as one). There was also some followup discussion on the Gephi project blog.

I’ve just seen a cluster of related Gephi experiments, which are reinforcing some of my prejudices from last year’s investigations:

These are all well worth a read, both for showing the potential and the limitations of Gephi. It’s not hard to find critiques of the intelligibility or utility of squiggly-but-inspiring network diagrams; Ben Fry’s point was well made. However I think each of the examples I link here (and my earlier experiments) show there is some potential in such layouts for showing ‘similarity neighbourhoods’ in a fairly appealing and intuitive form.

In the case of the history of Philosophy it feels a little odd using a network diagram since the chronological / timeline aspect is quite important to the notion of a history. But still it manages to group ‘like with like’, to the extent that the inter-node connections probably needn’t even be shown.

I’m a lot more comfortable with taking the ‘everything looks like a graph’ route if we’re essentially generating a similarity landscape. Whether these ‘landscapes’ can be made to be stable in the face of dataset changes or re-generation of the visualization is a longer story. Gephi is currently a desktop tool, and as such has memory issues with really large graphs, but I think it shows the potential for landscape-oriented graph visualization. Longer term I expect we’ll see more of a split between something like Hadoop+Mahout for big data crunching (e.g. see Mahout’s spectral clustering component which takes node-to-node affinities as input) and something WebGL and browser-based UI for the front-end. It’s a shame the Gephi efforts in this direction (GraphGL) seem to have gone quiet, but for those of you with modern graphics cards and browsers, take a look at alterqualia’s ‘dynamic terrain‘ WebGL demo to get a feel for how landscape-shaped datasets could be presented…

Also btw look at the griffsgraphs landscape of literature; this was built solely from ‘influences’ relationships from Wikipedia… then compare this with the landscapes I was generating last year from Harvard bibliographic data. They were both built solely using subject classification data from Harvard. Now imagine if we could mutate the resulting ‘map’ by choosing our own weighting composited across these two sources. Perhaps for the music or movies or TV areas of the map we might composite in other sources, based on activity data analysed by recommendation engine, or just different factual relationships.

There’s no single ‘correct’ view of the bibliographic landscape; what makes sense for a phd researcher, a job seeker or a schoolkid will naturally vary. This is true also of similarity measures in general, i.e. for see-also lists in plain HTML as well as fancy graph or landscape-based visualizations. There are more than metaphorical comparisons to be drawn with the kind of compositing tools we see in systems like Blender, and plenty of opportunities for putting control into end-user rather than engineering hands.

In just the last year, Harvard (and most recently OCLC) have released their bibliographic dataset for public re-use, the Wikidata project has launched, and browser support for WebGL has been improving with every release. Despite all the reasonable concerns out there about visualizing graphs as graphs, there’s a lot to be said for treating graphs as maps…

Problem statement

A Pew Research Center survey released a few days ago found that only half of Americans correctly know that Mr. Obama is a Christian. Meanwhile, 13 percent of registered voters say that he is a Muslim, compared with 12 percent in June and 10 percent in March.

More ominously, a rising share — now 16 percent — say they aren’t sure about his religion because they’ve heard “different things” about it.

When I’ve traveled around the country, particularly to my childhood home in rural Oregon, I’ve been struck by the number of people who ask something like: That Obama — is he really a Christian? Isn’t he a Muslim or something? Didn’t he take his oath of office on the Koran?

It was in the NYTimes, so it must be true. Will the last one to leave the Web please turn off the lights.

Apparently the UK government are revisiting the idea of net censorship, in the context of anti-terrorism.

UK Home Secretary Jacqui Smith as reported in the “Guardian, Government targets extremist websites“:

Speaking to the BBC’s Radio 4 Today programme before her speech, Smith said there were specific examples of websites that “clearly fall under the category of gratifying terrorism”. “There is growing evidence people may be using the internet both to spread messages and to plan specifically for terrorism,” she said. “That is why, as well as changing the law to make sure we can tackle that, there is more we need to do to show the internet is not a no-go area as far as tackling terrorism is concerned.”

This could go really wrong, really fast. Will we be allowed to read Bin Laden texts online? Hitler, Stalin? Talk to people who sympathise with organizations deemed terroristic? Who live in countries in the ‘axis of evil’? Doubtless the first sites to be targetted will be the most outrageous, but we’re on a slippery slope here.

It’s pretty much impossible to stop the online radicalisation of angry young men. But driving that process underground, and criminalising anyone on the fringes of the scene, will make it all the harder for calm voices and nuanced opinions to be heard. ‘Us and them’ is exactly what we don’t need right now.

“The World is now closed”

Facebook in many ways is pretty open for a ‘social networking’ site. It gives extension apps a good amount of access to both data and UI. But the closed world language employed in their UI betrays the immodest assumption “Facebook knows all”.

  • Eric Childress and Stuart Weibel are now friends with Charles McCathienevile.
  • John Doe is now in a relationship.
  • You have 210 friends.

To state the obvious: maybe Eric, Stu and Chaals were already friends. Maybe Facebook was the last to know about John’s relationship; maybe friendship isn’t countable. As the walls between social networking sites slowly melt (I put Jabber/XMPP first here, with OpenID, FOAF, SPARQL and XFN as helper apps), me and my 210 closest friends will share fragments of our lives with a wide variety of sites. If we choose to make those descriptions linkable, the linked sites will increasingly need to refine their UI text to be a little more modest: even the biggest site doesn’t get the full story.

Closed World Assumption (Abort/Retry/Fail)
Facebook are far from alone in this (see this Xbox screenshot too, “You do not have any friends!”); but even with 35M users, the mistake is jarring, and not just to Semantic Web geeks of the missing isn’t broken school. It’s simply a mistake to fail to distinguish the world from its description, or the territory from the map.

A description of me and my friends hosted by a big Web site isn’t “my social network”. Those sites are just a database containing claims made by different people, some verified, some not. And with, inevitably, lots missing. My “social network” is an abstractification of a set of interlinked real-world histories. You could make the case that there has only ever been one “social network” since the distant beginnings of human society; certainly those who try to do geneology with Web data formats run into this in a weaker form, including the need to balance competing and partial information. We can do better than categorised “buddylists” when describing people, their inter-connections and relationships. And in many ways Facebook is doing just great here. Aside from the Pirates-vs-Ninjas noise, many extension applications on Facebook allow arbitrary events from elsewhere in the Web to bubble up through their service and be seen (or filtered) by others who are linked to me in their database. For example:

Facebook is good at reporting events, generally. Especially those sourced outside the system. Where it isn’t so great is when reporting internal-events, eg. someone telling it about a relationship. Event descriptions are nice things to syndicate btw since they never go out of date. Syndicating descriptions of the changeable properties of the world, on the other hand, is more slippery since you need to have all other relevant facts to be able to say how the world is right now (or implicitly, how it used to be, before). “Dan has painted his car red” versus “Dan’s car is now red”. “Dan has bookmarked the Jabber user profile spec” versus “Dan now has 1621 bookmarks”. “Dan has added Charles to his Facebook profile” versus “Dan is now friends with Charles”.

We need better UI that reflects what’s really going on. There will be users who choose to live much of their lives in public view, spread across sites, sharing enough information for these accounts to be linked. Hopefully they’ll be as privacy-smart and selective as Pew suggests. Personas and ‘characters’ can be spread across sites without either site necessarily revealing a real-world identity; secrets are keepable, at least in theory. But we will see people’s behaviour and claims from one site leak into another, and with approval. I don’t think this will be just through some giant “social graph” of strictly enumerated relationships, but through a haze of vaguer data.

What we’re most missing is a style of end-user UI here that educates users about this world that spans websites, couching things in terms of claims hosted in sites, rather than in absolutist terms. I suppose I probably don’t have 210 “friends” (whatever that means) in real life, although I know a lot of great people and am happy to be linked to them online. But I have 210 entries in a Facebook-hosted database. My email whitelist file has 8785 email addresses in it currently; email accounts that I’m prepared to assume aren’t sending me spam. I’m sure I can’t have 8785 friends. My Google Mail (and hence GTalk Jabber) account claims 682 contacts, and has some mysterious relationship to my Orkut account where I have 200+ (more randomly selected) friends. And now the OpenID roster on my blog gives another list (as of today, 19 OpenIDs that made it past the WordPress spam filter). Modern social websites shouldn’t try to tell me how many friends I have; that’s just silly. And they shouldn’t assume their database knows it all. What they can do is try to tell me things that are interesting to me, with some emphasis on things that touch my immediate world and the extended world of those I’m variously connected to.

So what am I getting at here? I guess it’s just that we need these big social sites to move away from making teen-talk claims about how the world is – “Sally (now) loves John” – and instead become reflectors for the things people are saying, “Sally announces that she’s in love with John”; “John says that he used to work for Microsoft” versus “John worked for Microsoft 2004-2006″; “Stanford University says Sally was awarded a PhD in 2008″. Today’s young internet users are growing up fast, and the Web around them needs also to mature.

One of the most puzzling criticisms you’ll often hear about the Semantic Web initiative is that is requires a single universal truth, a monolithic ontology to model all of human knowledge. Those of us in the SW community know that this isn’t so; we’ve been saying for a long time that our (meta)data architecture is designed to allow people to publish claims “in which
statements can draw upon multiple vocabularies that are managed in a decentralised fashion by various communities of expertise.”
As the SemWeb technology stack now has a much better approach to representing data provenance (SPARQL named graphs replacing RDF’99 statement reification) I believe we should now be putting more emphasis on a related theme: Semantic Web data can represent disputes, competing claims, and contradictions. And we can query it in an SQL-like language (SPARQL) that allows us to ask questions not just of some all-knowing database, but about what different databases are telling us.

The closed world approach to data gives us a lot, don’t get me wrong. I’m not the only one with a love-hate relationship with SQL. There are many optimisations we can do in a traditional SQL or XML Schema environment which become hard in an RDF context. In particular, going “open world” makes for a harder job when hosting and managing data rather than merely aggregating and integrating it. Nevertheless, if you’re looking for a modern Web data environment for aggregating claims of the “Stanford University says Sally was awarded a PhD in 1995″ form, SPARQL has a lot to offer.

When we’re querying a single, all-knowing, all-trusted database, SQL will do the job (eg. see Facebook’s FQL for example). When we need to take a bit more care with “who said what” and “according to whom?” aspects, coupled with schema extensibility and frequently missing data, SQL starts to hurt. If we’re aggregating (and building UI for) ‘social web’ claims about the world rather than simple buddylists (which XMPP/Jabber gives us out of the box), I suspect aggregators will get burned unless they take care to keep careful track of who said what, whether using SPARQL or some home-grown database system in the same spirit. And I think they’ll find that doing so will be peculiarly rewarding, giving us a foundation for applications that do substantially more than merely listing your buddies…

IM/RSS bot – BBC Persian News Flash

OK this is old news, but pretty cool so I’m happy to write it up belatedly.

I just logged into MSN chat, and was greeted by Mario Menti’s IM bot, which provides a text-chat UI for navigating the BBC’s news feeds from their Persian service. I’m pasting the output here, hoping it’ll display reasonably. I can’t read a word of it of course, but remember Ian Forrester’s XTech talk a few years back about the headaches for getting I18N right for such feeds (and the varying performance of newsreader clients with right-to-left and mixed direction text). This hack came out of a conversation with Mario and Ian around the BBC Backstage scene, and from comments from a couple of friends in Tehran, this sort of technology direction is much appreciated by those whose news access is restricted. The bot is called bbcpersian at hotmail.co.uk, and seems to still be running 18 months later. See also some more recent hacks from Mario that wire up BBC feeds to twitter.

BBC Persian News Flash says: (23:01:02)

Hi, this is your hourly BBCPersian.com news flash with the 10 most recent new items
1 افزایش نیروها در عراق ‘درحال نتیجه دادن است’
2 انتقاد شدید کروبی از ‘مخالفان احزاب’
3 نواز شریف از پاکستان اخراج شد
4 بازداشت یکی از ‘قاچاقچیان بزرگ’ کلمبیا
5 ترکیه: کشورهای منطقه از اقدامات تنش زا دوری کنند
6 ‘عاشقان قلندر’ جشنواره ای دیگر برپا کردند
7 کاهش ساعت کار ادارات دولتی ایران در ماه رمضان
8 ‘عراقیها احساس امنیت بیشتری نمی کنند’
9 نواز شریف از پاکستان اخراج شد
10 شرکت مردم گواتمالا در انتخابات این کشور

Reply with number 1 to 10 to see more information, or any other message if you want to stop receiving these news flashes

Anyone know what the state of the art is with IM-based feed readers? or have a wishlist?

British Board of Film Classification RSS feeds and Movie metadata

The BBFC have several RSS feeds on their site, carrying information about their judgements on various cinematic works for a UK audience. Recent film decisions, recent adult (sex) videos and films, etc. Each entry in the feed points to a descriptive page and summarises a BBFC judgement in a simple textual description, eg. “The BFC gave the English language video LES PERVERSIONS 5 a rating of R18 on Thu, 10 Feb. Consumer advice is not supplied for R18 titles. the video is directed by Sineplex.“.

While their adult feed is interesting in the context of the debates around Web filtering etc., the mainstream feed is also interesting. It has textual information about sex, violence, drugs etc., which could easily be exposed in machine-processable form if they’d used RSS 1.0 + IRCA/RDF labels. Both make the semantic web point about data-reuse – since they can be used for finding things as much as for not finding things.

The BBFC gave the English language film TABLOID a rating of 18 on Fri, 28 Jan. This film contains STRONG SEX, VIOLENCE, LANGUAGE AND DRUG USE. The film is directed by David Blair. The cast includes Matthew Rhys, Mary Elizabeth Mastrantonio, David Soul, John Hurt, Stephen Tompkinson, Art Malik, Dani Behr, Keith Chegwin, Ainsley Harriott, Gail Porter, Beverley Callard, Les Dennis, Danny Dyer, James Hewitt, Freddie Jones, Vicky Holloway, Vikki Thomas and Anna Kumble.

I’ve been thinking about how FOAF could better support recommendation systems, eg. around MusicBrainz for music, or systems like MindSwap’s FilmTrust for movies. For movies, one core issue is quite simple: providing unique identifiers for films (direct or indirect, eg. via a page that has some film as it’s primary topic). BBFC or IMDB pages, or movie homepages, could serve such a purpose. Unfortunately, the world of movies doesn’t yet have a good open-content licensed database, unlike music, where we have MusicBrainz. Until we agree on some tricks for identifying things like movies (and actors, …), we won’t get the data integration needed to have a really rich Web-wide movie review system.

We will eventually, I am sure, see a framework in which various sites aggregate and syndicate such opinions, either numerical ratings or (more likely I think) textual reviews. Often I’m quite interested to see how a movie was perceived by people I disagree with, or have never met. The CapAlert site is often entertaining, for example. All these sources (as well as smaller community datasets) will be mixed together in a metadata marketplace. Information that some people use for filtering, blocking and avoiding will be used by others for searching, browsing and discovery. It’s just a matter of time before we’ll be using W3C’s new SPARQL technology to query BBFC judgement feeds, FOAF+review data from sites like like Filmtrust and other weblog-based data sources… Anyhow, definitely check out the Filmtrust site if you’re interested in movie metadata and ratings.

“Beyond blocking — U.S. and open source censorship slims the Net”, – Newsforge (May 2004)

Beyond blocking — U.S. and open source censorship slims the Net“, fairly interesting piece, but mostly entertaining because of the Google Ads for filtering technology that show up (at least now) while reading it. “Easily block Internet content with this new-generation content filter”, “Netmop projects families from pron. Parents too! Works with any ISP”, “Filter updates every 2 hours Only the Good Stuff – Chaperon”, …