Linked Literature, Linked TV – Everything Looks like a Graph

cloud

Ben Fry in ‘Visualizing Data‘:

Graphs can be a powerful way to represent relationships between data, but they are also a very abstract concept, which means that they run the danger of meaning something only to the creator of the graph. Often, simply showing the structure of the data says very little about what it actually means, even though it’s a perfectly accurate means of representing the data. Everything looks like a graph, but almost nothing should ever be drawn as one.

There is a tendency when using graphs to become smitten with one’s own data. Even though a graph of a few hundred nodes quickly becomes unreadable, it is often satisfying for the creator because the resulting figure is elegant and complex and may be subjectively beautiful, and the notion that the creator’s data is “complex” fits just fine with the creator’s own interpretation of it. Graphs have a tendency of making a data set look sophisticated and important, without having solved the problem of enlightening the viewer.

markets

Ben Fry is entirely correct.

I suggest two excuses for this indulgence: if the visuals are meaningful only to the creator of the graph, then let’s make everyone a graph curator. And if the things the data attempts to describe — for example, 14 million books and the world they in turn describe — are complex and beautiful and under-appreciated in their complexity and interconnectedness, … then perhaps it is ok to indulge ourselves. When do graphs become maps?

I report here on some experiments that stem from two collaborations around Linked Data. All the visuals in the post are views of bibliographic data, based on similarity measures derrived from book / subject keyword associations, with visualization and a little additional analysis using Gephi. Click-through to Flickr to see larger versions of any image. You can’t always see the inter-node links, but the presentation is based on graph layout tools.

Firstly, in my ongoing work in the NoTube project, we have been working with TV-related data, ranging from ‘social Web’ activity streams, user profiles, TV archive catalogues and classification systems like Lonclass. Secondly, over the summer I have been working with the Library Innovation Lab at Harvard, looking at ways of opening up bibliographic catalogues to the Web as Linked Data, and at ways of cross-linking Web materials (e.g. video materials) to a Webbified notion of ‘bookshelf‘.

In NoTube we have been making use of the Apache Mahout toolkit, which provided us with software for collaborative filtering recommendations, clustering and automatic classification. We’ve barely scratched the surface of what it can do, but here show some initial results applying Mahout to a 100,000 record subset of Harvard’s 14 million entry catalogue. Mahout is built to scale, and the experiments here use datasets that are tiny from Mahout’s perspective.

gothic_idol

In NoTube, we used Mahout to compute similarity measures between each pair of items in a catalogue of BBC TV programmes for which we had privileged access to subjective viewer ratings. This was a sparse matrix of around 20,000 viewers, 12,500 broadcast items, with around 1.2 million ratings linking viewer to item. From these, after a few rather-too-casual tests using Mahout’s evaluation measure system, we picked its most promising similarity measure for our data (LogLikelihoodSimilarity or Tanimoto), and then for the most similar items, simply dumped out a huge data file that contained pairs of item numbers, plus a weight.

There are many many smarter things we could’ve tried, but in the spirit of ‘minimal viable product‘, we didn’t try them yet. These include making use of additional metadata published by the BBC in RDF, so we can help out Mahout by letting it know that when Alice loves item_62 and Bob loves item_82127, we also via RDF also knew that they are both in the same TV series and Brand. Why use fancy machine learning to rediscover things we already know, and that have been shared in the Web as data? We could make smarter use of metadata here. Secondly we could have used data-derrived or publisher-supplied metadata to explore whether different Mahout techniques work better for different segments of the content (factual vs fiction) or even, as we have also some demographic data, different groups of users.

markets

Anyway, Mahout gave us item-to-item similarity measures for TV. Libby has written already about how we used these in ‘second screen’ (or ‘N-th’ screen, aka N-Screen) prototypes showing the impact that new Web standards might make on tired and outdated notions of “TV remote control”.

What if your remote control could personalise a view of some content collection? What if it could show you similar things based on your viewing behavior, and that of others? What if you could explore the ever-growing space of TV content using simple drag-and-drop metaphors, sending items to your TV or to your friends with simple tablet-based interfaces?

medieval_society

So that’s what we’ve been up to in NoTube. There are prototypes using BBC content (sadly not viewable by everyone due to rights restrictions), but also some experiments with TV materials from the Internet Archive, and some explorations that look at TED’s video collection as an example of Web-based content that (via ted.com and YouTube) are more generally viewable. Since every item in the BBC’s Archive is catalogued using a library-based classification system (Lonclass, itself based on UDC) the topic of cross-referencing books and TV has cropped up a few times.

new_colonialism

Meanwhile, in (the digital Public Library of) America, … the Harvard Library Innovation Lab team have a huge and fantastic dataset describing 14 million bibliographic records. I’m not sure exactly how many are ‘books'; libraries hold all kinds of objects these days. With the Harvard folk I’ve been trying to help figure out how we could cross-reference their records with other “Webby” sources, such as online video materials. Again using TED as an example, because it is high quality but with very different metadata from the library records. So we’ve been looking at various tricks and techniques that could help us associate book records with those. So for example, we can find tags for their videos on the TED site, but also on delicious, and on youtube. However taggers and librarians tend to describe things quite differently. Tags like “todo”, “inspirational”, “design”, “development” or “science” don’t help us pin-point the exact library shelf where a viewer might go to read more on the topic. Or conversely, they don’t help the library sites understand where within their online catalogues they could embed useful and engaging “related link” pointers off to TED.com or YouTube.

So we turned to other sources. Matching TED speaker names against Wikipedia allows us to find more information about many TED speakers. For example the Tim Berners-Lee entry, which in its Linked Data form helpfully tells us that this TED speaker is in the categories ‘Japan_Prize_laureates’, ‘English_inventors’, ‘1955_births’, ‘Internet_pioneers’. All good to know, but it’s hard to tell which categories tell us most about our speaker or video. At least now we’re in the Linked Data space, we can navigate around to Freebase, VIAF and a growing Web of data-sources. It should be possible at least to associate TimBL’s TED talks with library records for his book (so we annotate one bibliographic entry, from 14 million! …can’t we map areas, not items?).

tv

Can we do better? What if we also associated Tim’s two TED talk videos with other things in the library that had the same subject classifications or keywords as his book? What if we could build links between the two collections based not only on published authorship, but on topical information (tags, full text analysis of TED talk transcripts). Can we plan for a world where libraries have access not only to MARC records, but also full text of each of millions of books?

Screen%20shot%202011-10-11%20at%2010.15.07%20AM

I’ve been exploring some of these ideas with David Weinberger, Paul Deschner and Matt Phillips at Harvard, and in NoTube with Libby Miller, Vicky Buser and others.

edu

Yesterday I took the time to make some visual sanity check of the bibliographic data as processed into a ‘similarity space’ in some Mahout experiments. This is a messy first pass at everything, but I figured it is better to blog something and look for collaborations and feedback, than to chase perfection. For me, the big story is in linking TV materials to the gigantic back-story of context, discussion and debate curated by the world’s libraries. If we can imagine a view of our TV content catalogues, and our libraries, as visual maps, with items clustered by similarity, then NoTube has shown that we can build these into the smartphones and tablets that are increasingly being used as TV remote controls.

Screen%20shot%202011-10-11%20at%2010.12.25%20AM

And if the device you’re using to pause/play/stop or rewind your TV also has access to these vast archives as they open up as Linked Data (as well as GPS location data and your Facebook password), all kinds of possibilities arise for linked, annotated and fact-checked TV, as well as for showing a path for libraries to continue to serve as maps of the entertainment, intellectual and scientific terrain around us.

Screen%20shot%202011-10-11%20at%2010.16.46%20AM

A brief technical description. Everything you see here was made with Gephi, Mahout and experimental data from the Library Innovation Lab at Harvard, plus a few scripts to glue it all together.

Mahout was given 100,000 extracts from the Harvard collection. Just main and sub-title, a local ID, and a list of topical phrases (mostly drawn from Library of Congress Subject Headings, with some local extensions). I don’t do anything clever with these or their sub-structure or their library-documented inter-relationships. They are treated as atomic codes, and flattened into long pseudo-words such as ‘occupational_diseases_prevention_control’ or ‘french_literature_16th_century_history_and_criticism’,
‘motion_pictures_political_aspects’, ‘songs_high_voice_with_lute’, ‘dance_music_czechoslovakia’, ‘communism_and_culture_soviet_union’. All of human life is there.

David Weinberger has been calling this gigantic scope our problem of the ‘Taxonomy of Everything’, and the label fits. By mushing phrases into fake words, I get to re-use some Mahout tools and avoid writing code. The result is a matrix of 100,000 bibliographic entities, by 27684 unique topical codes. Initially I made the simple test of feeding this as input to Mahout’s K-Means clustering implementation. Manually inspecting the most popular topical codes for each cluster (both where k=12 to put all books in 12 clusters, or k=1000 for more fine-grained groupings), I was impressed by the initial results.

Screen%20shot%202011-10-11%20at%2010.22.37%20AM

I only have these in crude text-file form. See hv/_k1000.txt and hv/_twelve.txt (plus dictionary, see big file
_harv_dict.txt ).

For example, in the 1000-cluster version, we get: ‘medical_policy_united_states’, ‘health_care_reform_united_states’, ‘health_policy_united_states’, ‘medical_care_united_states’,
‘delivery_of_health_care_united_states’, ‘medical_economics_united_states’, ‘politics_united_states’, ‘health_services_accessibility_united_states’, ‘insurance_health_united_states’, ‘economics_medical_united_states’.

Or another cluster: ‘brain_physiology’, ‘biological_rhythms’, ‘oscillations’.

How about: ‘museums_collection_management’, ‘museums_history’, ‘archives’, ‘museums_acquisitions’, ‘collectors_and_collecting_history’?

Another, conceptually nearby (but that proximity isn’t visible through this simple clustering approach), ‘art_thefts’, ‘theft_from_museums’, ‘archaeological_thefts’, ‘art_museums’, ‘cultural_property_protection_law_and_legislation’, …

Ok, I am cherry picking. There is some nonsense in there too, but suprisingly little. And probably some associations that might cause offense. But it shows that the tooling is capable (by looking at book/topic associations) at picking out similarities that are significant. Maybe all of this is also available in LCSH SKOS form already, but I doubt it. (A side-goal here is to publish these clusters for re-use elsewhere…).

Screen%20shot%202011-10-11%20at%2010.23.22%20AM

So, what if we take this, and instead compute (a bit like we did in NoTube from ratings data) similarity measures between books?

Screen%20shot%202011-10-11%20at%2010.24.12%20AM

I tried that, without using much of Mahout’s sophistication. I used its ‘rowsimilarityjob’ facility and generated similarity measures for each book, then threw out most of the similarities except the top 5, later the top 3, from each book. From this point, I moved things over into the Gephi toolkit (“photoshop for graphs”), as I wanted to see how things looked.

Screen%20shot%202011-10-11%20at%2010.37.06%20AM

First results shown here. Nodes are books, links are strong similarity measures. Node labels are titles, or sometimes title + subtitle. Some (the black-background ones) use Gephi’s “modularity detection” analysis of the link graph. Others (white background) I imported the 1000 clusters from the earlier Mahout experiments. I tried various of the metrics in Gephi and mapped these to node size. This might fairly be called ‘playing around’ at this stage, but there is at least a pipeline from raw data (eventually Linked Data I hope) through Mahout to Gephi and some visual maps of literature.

1k_overview

What does all this show?

That if we can find a way to open up bibliographic datasets, there are solid opensource tools out there that can give new ways of exploring the items described in the data. That those tools (e.g. Mahout, Gephi) provide many different ways of computing similarity, clustering, and presenting. There is no single ‘right answer’ for how to present literature or TV archive content as a visual map, clustering “like with like”, or arranging neighbourhoods. And there is also no restriction that we must work dataset-by-dataset, either. Why not use what we know from movie/TV recommendations to arrange the similarity space for books? Or vice-versa?

I must emphasise (to return to Ben Fry’s opening remark) that this is a proof-of-concept. It shows some potential, but it is neither a user interface, nor particularly informative. Gephi as a tool for making such visualizations is powerful, but it too is not a viable interface for navigating TV content. However these tools do give us a glimpse of what is hidden in giant and dull-sounding databases, and some hints for how patterns extracted from these collections could help guide us through literature, TV or more.

Next steps? There are many things that could be tried; more than I could attempt. I’d like to get some variant of these 2D maps onto ipad/android tablets, loaded with TV content. I’d like to continue exploring the bridges between content (eg. TED) and library materials, on tablets and PCs. I’d like to look at Mahout’s “collocated terms” extraction tools in more details. These allow us to pull out recurring phrases (e.g. “Zero Sum”, “climate change”, “golden rule”, “high school”, “black holes” were found in TED transcripts). I’ve also tried extracting bi-gram phrases from book titles using the same utility. Such tools offer some prospect of bulk-creating links not just between single items in collections, but between neighbourhood regions in maps such as those shown here. The cross-links will never be perfect, but then what’s a little serendipity between friends?

As full text access to book data looms, and TV archives are finding their way online, we’ll need to find ways of combining user interface, bibliographic and data science skills if we’re really going to make the most of the treasures that are being shared in the Web. Since I’ve only fragments of each, I’m always drawn back to think of this in terms of collaborative work.

A few years ago, Netflix had the vision and cash to pretty much buy the attention of the entire machine learning community for a measly million dollars. Researchers love to have substantive datasets to work with, and the (now retracted) Netflix dataset is still widely sought after. Without a budget to match Netflix’, could we still somehow offer prizes to help get such attention directed towards analysis and exploitation of linked TV and library data? We could offer free access to the world’s literature via a global network of libraries? Except everyone gets that for free already. Maybe we don’t need prizes.

Nearby in the Web: NoTube N-Screen, Flickr slideshow

Remote remotes

I’ve just closed the loop on last weekend’s XMPP / Apple Remote hack, using Strophe.js, a library that extends XMPP into normal Web pages. I hope I’ll find some way to use this in the NoTube project (eg. wired up to Web-based video playing in OpenSocial apps), but even if not it has been a useful learning experience. See this screenshot of a live HTML page, receiving and displaying remotely streamed events (green blob: button clicked; grey blob: button released). It doesn’t control any video yet, but you get the idea I hope.

Remote apple remote HTML demo

Remote apple remote HTML demo, screenshot showing a picture of handheld apple remote with a grey blob over the play/pause button, indicating a mouse up event. Also shows debug text in html indicating ButtonUpEvent: PLPZ.

This webclient needs the JID and password details for an XMPP account, and I think these need to be from the same HTTP server the HTML is published on. It works using BOSH or other tricks, but for now I’ve not delved into those details and options. Source is in the Buttons area of the FOAF svn: webclient. I made a set of images, for each button in combination with button-press (‘down’), button-release (‘up’). I’m running my own ejabberd and using an account ‘buttons@foaf.tv’ on the foaf.tv domain. I also use generic XMPP IM accounts on Google Talk, which work fine although I read recently that very chatty use of such services can result in data rates being reduced.

To send local Apple Remote events to such a client, you need a bit of code running on an OSX machine. I’ve done this in a mix of C and Ruby: imremoted.c (binary) to talk to the remote, and the script buttonhole_surfer.rb to re-broadcast the events. The ruby code uses Switchboard and by default loads account credentials from ~/.switchboardrc.

I’ve done a few tests with this setup. It is pretty responsive considering how much indirection is involved: but the demo UI I made could be prettier. The + and – buttons behave differently to the left and right (and menu and play/pause); only + and – send an event immediately. The others wait until the key is released, then send a pair of events. The other keys except for play/pause will also forget what’s happening unless you act quickly. This seems to be a hardware limitation. Apparently Apple are about to ship an updated $20 remote; I hope this aspect of the design is reconsidered, as it limits the UI options for code using these remotes.

I also tried it using two browsers side by side on the same laptop; and two laptops side by side. The events get broadcasted just fine. There is a lot more thinking to do re serious architecture, where passwords and credentials are stored, etc. But XMPP continues to look like a very interesting route.

Finally, why would anyone bother installing compiled C code, Ruby (plus XMPP libraries), their own Jabber server, and so on? Well hopefully, the work can be divided up. Not everyone installs a Jabber server. My thinking is that we can bundle a collection of TV and SPARQL XMPP functionality in a single install, such that local remotes can be used on the network, but also local software (eg. XBMC/Plex/Boxee) can also be exposed to the wider network – whether it’s XMPP .js running inside a Web page as shown here, or an iPhone or a multi-touch table. Each will offer different interaction possibilities, but they can chat away to each other using a common link, and common RDF vocabularies (an area we’re working on in NoTube). If some common micro-protocols over XMPP (sending clicks or sending commands or doing RDF queries) can support compelling functionality, then installing a ‘buttons adaptor’ is something you might do once, with multiple benefits. But for now, apart from the JQbus piece, the protocols are still vapourware. First I wanted to get the basic plumbing in place.

Update: I re-discovered a useful ‘which bosh server do you need?’ which reminds me that there are basically two kinds of BOSH software offering; those that are built into some existing Jabber/XMPP server (like the ejabberd installation I’m using on foaf.tv) and those that are stand-alone connection managers that proxy traffic into the wider XMPP network. In terms of the current experiment, it means the event stream can (within the XMPP universe) come from addresses other than the host running the Web app. So I should also try installing Punjab. Perhaps it will also let webapps served from other hosts (such as opensocial containers) talk to it via JSON tricks? So far I have only managed to serve working Buttons/Strophe HTML from the same host as my Jabber server, ie. foaf.tv. I’m not sure how feasible the cross-domain option is.

Update x2: Three different people have now mentioned opensoundcontrol to me as something similar, at least on a LAN; it clearly deserves some investigation

Referata, a Semantic Media Wiki hosting site

From Yaron Koren on the semediawiki-users list:

I’m pleased to announce the release of the site Referata, at referata.com: a hosting site for SMW-based semantic wikis. This is not the first site to offer hosting of wikis using Semantic MediaWiki (that’s Wikia, as of a few months ago), but it is the first to also offer the usage of Semantic Forms, Semantic Drilldown, Semantic Calendar, Semantic Google Maps and some of the other related extensions you’ve probably heard about; Widgets, Header Tabs, etc. As such, I consider it the first site that lets people create true collaborative databases, where many people can work together on a set of well-structured data.

See announcement and their features page for more details. Basic usage is free; $20/month premium accounts can have private data, and $250/month enterprise accounts can use their own domains. Not a bad plan I think. A showcase Referata wiki would help people understand the offering better. In the meantime there is elsewhere a list of sites using Semantic MediaWiki. That list omits Chickipedia; we can only wonder why. Also I have my suspicions that Intellipedia runs with the SMW extensions too, but that’s just guessing. Regardless, there are a lot of fun things you could do with this, take a look…

Foundation Nation: new orgs for Infocards, Symbian

Via the [IP] list, I read that the Information Card Foundation has launched.

Information Cards are the new way to control your personal data and identity on the web.

The Information Card Foundation is a group of thoughtful designers, architects, and companies who want to make the digital world easier for you by building better products that help you get control of your personal information.

From their blog, where Charles Andres offers a historical account of where they fit in:

And by early 2007, four tribes in the newly discovered continent of user-centric identity had united under the banner of OpenID 2.0 and brought the liberating power of user-controlled identifiers to the digital identity pioneers. The OpenID community formed the OpenID Foundation to serve as a trustee for intellectual property and a host for community activity and by early 2008 had attracted Microsoft, Yahoo, Google, VeriSign, and IBM to join as corporate directors.

Inspired by these efforts, the growing Information Card community realized that to bring this metaphor to full fruition required taking the same step—coming together into a common organization that would unify our efforts to create an interoperable identity layer. From one perspective this could be looked at as completing the “third leg of the stool” of what is often called the Venn of Identity (SAML, OpenID, and Information Cards). But from another perspective, you can see it as one of the logical steps needed towards the cooperative convergence among identity systems and protocols that will be necessary to reach a ubiquitous Internet identity layer—the layer that completes the hat trick.

I’m curious to see what comes of this. There’s some big backing, and I’ve heard good things about Infocard from folks in the know. From an SemWebby perspective, this stuff just gives us another way to figure out the provenance of claim graphs representable in RDF, queryable in SPARQL. And presumably some more core schemas to play with…

Meanwhile in the mobile scene, a Symbian Foundation has been unveiled:

Industry leaders to unify the Symbian mobile platform and set it free
Foundation to be established to provide royalty-free open platform and accelerate innovation

The demand for converged mobile devices is accelerating. By 2010 we expect four billion people to have joined the global mobile conversation. For many of these people, their mobile will be their first Internet experience, not just their first camera, music player or phone.

Open software is the basic building block for delivering this future.

With this in mind, industry leaders are coming together to establish Symbian Foundation, to bring to life a shared vision and to create the most proven, open and complete mobile software platform – available for free. To achieve this, the foundation will unify Symbian, S60, UIQ and
MOAP(S) software to create an unparalleled open software platform for converged mobile devices, enabling the whole mobile ecosystem to accelerate innovation.

The foundation is expected to start operating during the first half of 2009. Membership of the foundation will be open to all organizations, for a low annual membership fee of US $1,500.

I’ll save my pennies for an iPhone. Everybody’s open nowadays, I guess that’s good…

WHY MIGHT CONNECTING WITH ZANDER JULES BE A GOOD IDEA?

Or: towards evidence-based ‘add a contact’ filtering…

This just in from LinkedIn:

Have a question? Zander Jules’s network will probably have an answer
You can use LinkedIn Answers to distribute your professional questions to Zander Jules and your extended network. You can get high-quality answers from experienced professionals.

Zander Jules requested to add you as a connection on LinkedIn:

Dan,

Dear
My name is Zander Jules a Banker and accountant with Bank Atlantique Cote Ivoire.I contacting u for a business transfer of a large sum of money from a dormant account. Though I know that a transaction of this magnitude will make any one apprehensive,
but I am assuring u all will be well at the end of the day.I am the personal accounts manager to Engr Frank Thompson, a National of ur country, who used to work with an oil servicing company here in Cote Ivoire. My client, his wife & their 3 children were involved in the ill fated Kenya Airways crash in the coasts of Abidjan in January 2000 in which all passengers on board died. Since then I have made several inquiries to ur embassy to locate any of my clients extended relatives but has been unsuccessful.After several attempts, I decided to trace his last name via internet,to see if I could locate any member of his
family hence I contacted u.Of particular interest is a huge deposit with our bank in our country,where the deceased has an account valued at about $16 million USD.They have issued me notice to provide the next of kin or our bank will declare the account unservisable and thereby send the funds to the bank treasury.Since I have been unsuccessful in locating the relatives for past 7 yrs now, I will seek ur consent to present you as the next of kin of the deceased since u have the same last names, so that the proceeds of this account valued at $16million USD can be paid to u and then u and I can share the money.All I require is your honest cooperation to enable us see this deal through. I guarantee that this will be executed under all legitimate arrangement that will protect you from any breach of the law. In your reply mail, I want you to give me your full names, address, D.O.B, tel& fax #.If you can handle this with me, reach me for more details.

Thanking u for ur coperation.
Regards,

I’m suprised we’ve not seen more of this, and sooner. Youtube contacts are pretty spammy, and twitter have also suffered. The other networks are relatively OK so far. But I don’t think they’re anything like as robust as they’ll need to get, particularly since a faked contact can get privileged access to personal details. Definitely an arms race…

A tale of two business models

Glancing back at 1998, thanks to the Wayback Machine.

W3C has Royalty Free licensing requirements and a public Process Document for good reason. I’ve been clicking back through the papertrails around the W3C P3P vs InterMind patent issue as a reminder. Here is the “Appendix C: W3C Licensing Plan Summary” from the old Intermind site:

We expect to license the patents to practice the P3P standard as it evolves over time as follows:

User Agents: For User Agent functionality, all commercial licensees will pay a royalty of 1% of revenues directly associated with the use of P3P or 0.1% of all revenues directly associated with the product employing the User Agent at the licensee’s option. We expect to measure revenues in a confidential and mutually acceptable manner with the licensee.

Service Providers: For Service Provider functionality, all commercial licensees will pay a royalty of 1% of revenues directly associated with the use of P3P. We expect to measure these revenues through the use of Web site logs, which will determine the percentage of P3P-related traffic and apply that percentage to the relevant Web site revenues (i.e., advertising-based revenues or transaction-based revenues). We expect to determine a method for monitoring or auditing such logs in a confidential and mutually acceptable manner with the licensee.

[...]

Intermind Corporation also expects to license the patents for CDF, ICE, and other XML-based agent technology on non-discriminatory terms. Members interested in further information on the relationship of Intermind’s patents to these technologies can contact Drummond Reed at drummond@intermind.com or 206-812-6000.

Nearby in the Web:

Cover Pages on Extensible Name Service (XNS):

Background: “In January 1999, the first of Intermind’s web agent patents began being issued (starting with U.S. patent No. 5,862,325). At the heart of this patent was a new naming and addressing service based on web agent technology. With the emergence of XML as a new global data interchange language — one perfectly suited to the requirements of a global ‘language’ for web agents — Intermind changed its name to OneName Corporation, built a new board and management team, and embarked on the development of this new global naming and addressing service. Because its use of XML as the foundation for all object representation and interchange led to the platform, yet had the same distributed architecture as DNS, it was christened eXtensible Name Service, or XNS. Recognizing the ultimate impact such a system may have on Internet infrastructure, and the crucial role that privacy, security, and trust must play, OneName also made the commitment to building it with open standards, open source software, and an open independent governance organization. Thus was born the XNS Public Trust Organization (XNSORG), the entity charged with setting the technical, operational, and legal standards for XNS.”

Over on XDIORG – Licenses and Agreements:

Summary of the XDI.ORG Intellectual Property Rights Agreement

NOTE: This summary is provided as a convenience to readers and is not intended in any way to modify or substitute for the full text of the agreement.

The purpose of the XDI.ORG IPR Agreement between XDI.ORG and OneName Corporation (dba Cordance) is to facilitate and promote the widespread adoption of XDI infrastructure by transfering the intellectual property rights underlying the XDI technology to a community-governed public trust organization.

The agreement grants XDI.ORG exclusive, worldwide, royalty-free license to a body of patents, trademarks, copyrights, and specifications developed by Cordance on the database linking technology underlying XDI. In turn, it requires that XDI.ORG manage these intellectual property rights in the public interest and make them freely available to the Internet community as royalty-free open standards. (It specifically adopts the definition provided by Bruce Perens which includes the ability for XDI.ORG to protect against “embrace and enhance” strategies.)

There is also a Global Service Provider aspect to this neutral, royalty-free standard. Another excerpt:

Summary of the XDI.ORG Global Service Provider Agreement

NOTE: This summary is provided as a convenience to readers and is not intended in any way to modify or substitute for the full text of the agreement.

Global Services are those XDI services offered by Global Service Providers (GSPs) based on the XRI Global Context Symbols (=, @, +, !) to facilitate interoperability of XDI data interchange among all users/members of the XDI community. XDI.ORG governs the provision of Global Services and has the authority to contract with GSPs to provide them to the XDI community.

For each Global Service, XDI.ORG may contract with a Primary GSP (similar to the operator of a primary nameserver in DNS) and any number of Secondary GSPs (similar to the operator of a secondary DNS nameserver). The Secondary GSPs mirror the Primary for loadbalancing and failover. Together, the Primary and Secondary GSPs operate the infrastructure for each Global Service according to the Global Services Specifications published and maintained by XDI.ORG.

The initial XDI.ORG GSP Agreement is between XDI.ORG and OneName Corporation (dba Cordance). The agreement specifies the rights and obligations of both XDI.ORG and Cordance with regard to developing and operating the first set of Global Services. For each of these services, the overall process is as follows:

  • If Cordance wishes to serve as the Primary GSP for a service, it must develop and contribute an initial Global Service Specification to XDI.ORG.
  • XDI.ORG will then hold a public review of the Global Service Specification and amend it as necessary.
  • Once XDI.ORG approves the Global Service Specification, Cordance must implement it in a commercially reasonable period. If Cordance is not able to implement or operate the service as required by the Global Service Specification, XDI.ORG may contract with another party to be the primary GSP.
  • XDI.ORG may contract with any number of Secondary GSPs.
  • If XDI.ORG desires to commence a new Global Service and Cordance does not elect to develop the Global Service Specification or provide the service, XDI.ORG is free to contract with another party.

The contract has a fifteen year term and covers a specified set of Global Services. Those services are divided into two classes: cost-based and fee-based. Cost-based services will be supplied by Cordance at cost plus 10%. Fee-based services will be supplied by Cordance at annual fees not to exceed maximums specified in the agreement. These fees are the wholesale cost to XDI.ORG; XDI.ORG will then add fees from any other GSPs supplying the service plus its own overhead fee to determine the wholesale price to registrars (registrars then set their own retail prices just as with DNS). Cordance’s wholesale fees are based on a sliding scale by volume and range from U.S. $5.40 down to $3.40 per year for global personal i-names and from U.S. $22.00 down to $13.50 per year for global organizational i-names.

The agreement also ensures all registrants of the original XNS Personal Name Service and Organizational Name Service have the right to convert their original XNS registration into a new XDI.ORG global i-name registration at no charge. This conversion period must last for at least 90 days after the commencement of the new global i-name service.

Over on inames.net, Become an i-broker:

What Is an I-Broker?

From Wikipedia:

“I-brokers are ‘bankers for data’ or ‘ISPs for identity services’–trusted third parties that help people and organizations share private data the same way banks help us exchange funds and ISPs help us exchange email and files.”

I-brokers are the core providers of XRI digital identity infrastructure. They not only provide i-name and i-number registration services, but also they provide i-services: a new layer of digital identity services that help people and business safely interact on the Internet. See the I-Service Directory for the first open-standard i-services that can be offered by any XDI.org-accredited i-broker.

How Does an I-Broker Become XDI.org-Accredited?

Cordance and NeuStar, together with XDI.org, have published a short guide to the process, “Becoming an I-Broker,” which includes all the information necessary to get started. It also includes contact information for the i-broker support teams at both companies.

Download Becoming an I-Broker

In addition the following two zip files contain all the documents needed by an i-broker. The first one contains the i-broker application and agreements, which are appendicies F and I of the XDI.org Global Services Specifications (GSS), located in their complete form at http://gss.xdi.org. The second one contains all the rest of the GSS documents referenced by the application and agreement.

From that downloadable PDF,

What is the application fee?
The standard application fee is USD $2500. However between the GRS opening on June 20th and the
start of Digital ID World 2006 on September 11, 2006, you may apply to Cordance to have the fee offset
by development and marketing commitments. For more information, contact Cordance or NeuStar at the
addresses below.

In the XDI.org GSS wiki,

6.1. Appendix B: Fee Schedules

Taking a look at that, we see:

Global Services Specification V1.0
Appendix B: Fee Schedule
Revised Version: 1 September, 2007

[...]

This and all contributions to XDI.org are open, public, royalty-free specifications licensed
under the XDI.org license available at http://www.xdi.org/docref/legal/xdi-org-license.html.

…where I find I can buy a personal i-name for $5/year, or a business i-name for 29.44, as in individual. Or as “become an i-broker” points out,

For volume pricing, contact Cordance or NeuStar at the addresses below.

Amazon EC2: My 131 cents

Early this afternoon, I got it into my head to try out Amazon’s “Elastic Compute Cloud” (EC2) service. I’m quite impressed.

The bill so far, after some playing around, is 1.31 USD. I’ve had the default “getting started” Fedora Linux box up and running for maybe 7 hours, as well as trying out machine images (AMIs) preconfigured for Virtuoso data spaces, and for Ruby on Rails. Being familiar with neither, I’m impressed by the fact that I can rehydrate a pre-prepared Linux machine configured for these apps, simply by clicking a button in the EC2 Firefox addon. Nevertheless I managed to get bogged down with both toolkits; wasn’t in an RTFM mood, and even Rails can be fiddly if you’re me.

I’ve done nothing very compute or bandwidth intensive yet; am sure the costs could crank up a bit if I started web crawling and indexing. But it is an impressive setup, and one you can experiment with easily and cheaply, especially if you’ve already got an Amazon account. Also, being billed in USD is always cheering. The whole thing is controlled through Web service interfaces, which are hidden from me since I use either the Firefox addon, or else the commandline tools, which work fine on MacOSX once you’ve set up some paths and env variables.

I can now just type “ec2-run-instances ami-e2ca2f8b -k sandbox-keypair” or similar (the latter identifies a public key to install in the server), to get a new Linux machine setup within a couple minutes, pre-configured in this case with Virtuoso Data Spaces. And since the whole environment is therefore scriptable, virtual machines can beget virtual machines. Zero-click purchase – just add money :)

So obviously I’m behind on the trends here, but hey, it’s never too late. The initial motivate for checking out EC2 was a mild frustration with the DreamHost setup I’ve been using for personal and FOAF stuff. No huge complaints, just that I was cheap and signed up for shared-box access, and it’s kind of hard not having root access after 12 years or so of having recorse to superpowers when needed. Also DreamHost don’t do Java, which is midly annoying. Some old FOAF stuff of Libby’s is in Java, and of course it’d be good to be able to make more use of Jena.

As I type this, I’m sat across the room from an ageing linux box (with broken CPU fan), taking up space and bringing a tangle of cables to my rather modest living room. I had been thinking about bringing it back to life as a dev box, since I’m otherwise working only on Mac and WinXP laptops. Having tried EC2, I see no reason to clutter my home with it. I’d like to see a stronger story from Amazon about backing up EC2 instances, but … well I’ll like to see a stronger story from me on that topic too :)

Religious Technology (What’s in a link?)

I’ve just rediscovered this, while tidying: a letter I received last year, from Hodkin And Company, Solicitors:

We understand that you operate, control or manage a website on which one Mr Damien Steer has placed literally hundreds of pages of our clients’ copyrighted works without the authorization of our clients. Because of the enormity and volume of the infringements, we have broken these down under separate headings [...]

Damian removed the documents immediately, publishing a scan of (his version of) the letter in their place. See also Karin Spaink’s pages for more on the Scientology materials concerned. I’ve not followed the twists and turns of the whole thing, but here’s a note from Karin’s page:

This homepage is approved of by court. Twice, by now. It has thereby become the world’s first legal Fishman Homepage. Read the ruling of the February 1996 lawsuit, summary proceedings, in either English or Dutch. On June 10, 1999, there was a second ruling, this time in full procedure: my page can still stay up. Read the ruling in Dutch or in English. Scientology has appealed this ruling. It is not yet known when pleas will be held.

Although I won, there’s one thing that seriously bugs me, and other people. The court ruled that hyperlinks and url’s refering to pages that contain infringing material must in themselves be considered to be infringing. That cuts at the heart of the net. To name one example: it makes search engines illegal: they often refer to pages that contain infringing material.

More from the letter I received…

You should be aware that numerous permanent injunctions and awards of statutory damages and attorney’s fees have been entered regarding similar infringements. For instance, a jury in the United States District Court in San Jose, California awarded statutory damages in the amount of $75,000 against a Mr. Henson for posting only one of the NOTs works on the Internet.

Maybe I’m in the wrong business?