XMPP untethered – serverless messaging in the core?

In the XMPP session at last february’s FOSDEM I gave a brief demo of some NoTube work on how TV-style remote controls might look with XMPP providing their communication link. For the TV part, I showed Boxee, with a tiny Python script exposing some of its localhost HTTP API to the wider network via XMPP. For the client, I have a ‘my first iphone app‘ approximation of a remote control that speaks a vapourware XMPP remote control protocol, “Buttons”.

The point of all this is about breaking open the Web-TV environment, so that different people and groups get to innovate without having to be colleagues or close-nit business partners. Control your Apple TV with your Google Android phone; or your Google TV with your Apple iPad, or your Boxee box with either. Write smart linking and bookmarking and annotation apps that improve TV for all viewers, rather than only those who’ve bought from the same company as you. I guess I managed to communicate something of this because people clapped generously when my iphone app managed to pause Boxee. This post is about how we might get from evocative but toy demos to a useful and usable protocol, and about one of our largest obstacles: XMPP’s focus on server-mediated communications.

So what happened when I hit the ‘pause’ button on the iphone remote app? Well, the app was already connected to the XMPP network, e.g. signed in as bob.notube@gmail.com via Google Talk’s servers. And so an XMPP stanza flowed out from the room we were in, across to Google somewhere, and then via XMPP server-to-server protocol over to my self-run XMPP server (an ejabberd hosted on Amazon EC2’s east USA zone somewhere). And from there, the message returned finally to Brussels, flowing through whichever Python library I was using to Boxee (signed in as buttons@foaf.tv), causing the video to pause. This happened quite quickly, and generally very quickly; but sometimes it can take more than a second. This can be very frustrating, and while there are workaround (keep-alive messages, smart code that ignores sequences of buffered ‘Pause!’ messages, apps that download metadata and bring more UI to the second screen, …), the problem has a simple cause: it just doesn’t make sense for a ‘pause’ message to cross the atlantic twice, and pass through two XMPP servers, on its the way across the living room from remote control to TV.

But first – why are we even using XMPP at all, rather than say HTTP? Partly because XMPP lets us easily address devices on home networks, that aren’t publically exposed as running a Web server. Partly for the symmetry of the protocol, since ipads, touch tables, smart phones, TVs and media centres all can host and play media items on their own displays, and we may have several such devices in a home setting that need to be in touch with one another. There’s also a certain lazyness; XMPP already defines lots of useful pieces, like buddylist rosters, pubsub notifications, group chats; it has an active and friendly community, and it comes with a healthy collection of tools and libraries. My own interests are around exploring and collectively annotating the huge archives of content that are slowly coming online, and an expectation that this could be a more shared experience, so I’m following an intuition that XMPP provides more useful ‘raw materials’ for social content exploration than raw HTTP. That said, many elements of remote control can be defined and implemented in either environment. But for today, I’m concentrating on the XMPP side.

So back at FOSDEM I raised a couple of concerns, as a long-term XMPP well-wisher but non-insider.

The first was that the technology presents itself as a daunting collection of extensions, each of which might or might not be supported in some toolkit. To this someone (likely Dave Cridland) responded with the reassuring observation that most of these could be implemented by 3rd party app developer simply reading/writing XMPP stanzas. And that in fact pretty much the only ‘core’ piece of XMPP that wasn’t treated as core in most toolkits was the serverless, point-to-point XEP-0174 ‘serverless messaging‘ mode. Everything else, the rest of us mortals could hack in application code. For serverless messaging we are left waiting and hoping for the toolkit maintainers to wire things in, as it generally requires fairly intimate knowledge of the relevant XMPP library.

My second point was in fact related: that if XMPP tools offered better support for serverless operation, then it would open up lots of interesting application options. That we certainly need it for the TV remotes use case to be a credible use of XMPP. Beyond TV remotes, there are obvious applications in the area of open, decentralised social networking. The recent buzz around things like StatusNet, GNU Social, Diaspora*, WebID, OneSocialWeb, alongside the old stuff like FOAF, shows serious interest in letting users take more decentralised control of their online social behaviour. Whether the two parties are in the same room on the same LAN, or halfway around the world from each other, XMPP and its huge collection of field-tested, code-supported extensions is relevant, even when those parties prefer to communicate directly rather than via servers.

With XMPP, app party developers have a well-defined framework into which they can drop ad-hoc stanzas of information; whether it’s a vCard or details of recently played music. This seems too useful a system to reserve solely for communications that are mediated by a server. And indeed, XMPP in theory is not tied to servers; the XEP-0174 spec tells us both how to do local-network bonjour-style discovery, and how to layer XMPP on top of any communication channel that allows XML stanzas to flow back and forth.

From the abstract,

This specification defines how to communicate over local or wide-area networks using the principles of zero-configuration networking for endpoint discovery and the syntax of XML streams and XMPP messaging for real-time communication. This method uses DNS-based Service Discovery and Multicast DNS to discover entities that support the protocol, including their IP addresses and preferred ports. Any two entities can then negotiate a serverless connection using XML streams in order to exchange XMPP message and IQ stanzas.

But somehow this remains a niche use of XMPP. Many of the toolkits have some support for it, perhaps as work-in-progress or a patch, but it remains somewhat ‘out there’ rather than core to the XMPP approach. I’d love to see this change in 2011. The 0174 spec combines a few themes; it talks a lot about discovery, motivated in part by trade-fair and conference type scenarios. When your Apple laptop finds people locally on some network to chat with by “Bonjour”, it’s doing more or less XEP-0174. For the TV remote scenario, I’m interested in having nodes from a normal XMPP network drop down and “re-discover” themselves in a hopefully-lower-latency point to point mode (within some LAN or across the Internet, or between NAT-protected home LANs). There are lots of scenarios when having a server in the loop isn’t needed, or adds cost and risk (latency, single point of failure, privacy concerns).

XEP-0174 continues,

6. Initiating an XML Stream
In order to exchange serverless messages, the initiator and
recipient MUST first establish XML streams between themselves,
as is familiar from RFC 3920.
First, the initiator opens a TCP connection at the IP address
and port discovered via the DNS lookup for an entity and opens
an XML stream to the recipient, which SHOULD include 'to' and
'from' address. [...]

This sounds pretty precise; point-to-point communication is over TCP.  The Security Considerations section discussed some of the different constraints for XMPP in serverless mode, and states that …

To secure communications between serverless entities, it is RECOMMENDED to negotiate the use of TLS and SASL for the XML stream as described in RFC 3920

Having stumbled across Datagram TLS (wikipedia, design writeup), I wonder whether that might also be an option for the layer providing the XML stream between entities.  For example, the chownat tool shows a UDP-based trick for establishing bidirectional communication between entities, even when they’re both behind NAT. I can’t help but wonder whether XMPP could be layered somehow on top of that (OpenSSL libraries have Datagram TLS support already, apparently). There are also other mechanisms I’ve been discussing with Mo McRoberts and Libby Miller lately, e.g. Mo’s dynamic dns / pubkeys idea, or his trick of running an XMPP server in the home, and opening it up via UPnP. But that’s for another time.

So back on my main theme: XMPP is holding itself back by always emphasising the server-mediated role. XEP-0174 has the feel of an afterthought rather than a core part of what the XMPP community offers to the wider technology scene, and the support for it in toolkits lags similarly. I’d love to hear from ‘live and breath XMPP’ folk what exactly they think is needed before it can become a more central part of the XMPP world.

From the TV remotes use case we have a few constraints, such as the need to associate identities established in different environments (eg. via public key). If xmpp:danbri-ipad@danbri.org is already on the server-based XMPP roster of xmpp:nevali-tv@nevali.net, can pubkey info in their XMPP vCards be used to help re-establish trusted communications when the devices find themselves connected in the same LAN? It seems just plain nuts to have a remote control communicate with another box in the same room via transatlantic links through Google Talk and Amazon EC2, and yet that’s the general pattern of normal XMPP communications. What would it take to have more out-of-the-box support for XEP-0174 from the XMPP toolkits? Some combination of beer, money, or a shared sense that this is worth doing and that XMPP has huge potential beyond the server-based communications model it grew from?

FOAF vocabulary management details

I have an action from the Vocabulary Management task force of the W3C SW Best Practices Working Group to document the way the FOAF namespace currently works. Here goes.

The FOAF RDF vocabulary is named by the URI http://xmlns.com/foaf/0.1/ and has information available at that URI in both machine-friendly and human-friendly form. When that URI is dereferenced by HTTP, the default representation is currently HTML-based.

Aside: I say based because the markup, while basically being XHTML, also includes incline RDF/XML describing the FOAF vocabulary. The HTML TF of SWBPD WG is looking at ways of making such documents validatable; I hope a future version will be valid in some way, as well as presentable in mainstream browsers. This might be via RDF/A in XHTML2, RDF/A back-ported to an early XHTML variant, a GRDDL-based transform, or perhaps a CDF document if they’re deployable in legacy browsers.

Backing up to the big picture: we want to make it easy for both people and machines to find out what they need. So the basic idea is that there’s an HTML document describing FOAF for humans, and a set of RDF statements describing FOAF for machines. The RDF statements are available in several ways. As embedded RDF/XML in the HTML page. As a separate index.rdf document (this should be LINK REL’d from the former), and as a content-negotiable representation of the main URI, available to clients that send an “Accept: application/rdf+xml” header.

In addition to this, each FOAF term (assuming the Apache .htaccess is up to date; this might not be currently true) is redirected by the Web server to the main FOAF namespace URI. For example, http://xmlns.com/foaf/0.1/Person. This should in the future be done with a 303 HTTP redirection code, to be consistent with the recent W3C TAG decision on http-range-14 (apologies for the jargon and lack of links to explanation). Because FOAF term URIs don’t contain a # [edit: I wrote / previously; thanks mortenf], note that the redirection can’t point down into a sub-section of the HTML document. To make the HTML document more usable, we should probably put a better table of contents for terms nearer to the top of the page, to allow someone to navigate quickly to the description of that term.

One last point: each term defined in FOAF is accompanied not only by a label and comment (standard from RDFS); it also has a chunk of HTML markup, with cross-references to other terms. At the moment, this is not made available in any machine-readable form. There might be some scope here for common practice with SKOS and other vocabularies?

Sorry for the hasty writeup, I just wanted to get this recorded as a starting point…