MAMP / MySQL config notes for ‘Repair with keycache’ and table metadata lock

Problem: MySQL taking forever to load some large data dumps. Forever or longer.

“mysql> show processlist;” shows it wedged at “Repair with keycache” and “Waiting for table metadata lock”.

According to a handy Stack Overflow article, this is a known and dreaded condition, which can be addressed by making sure tmp dir has plenty of space, and increasing size of myisam_max_sort_file_size from 2G (2146435072) to 30G (32212254720). Using MAMP 1.9.6 it took some more digging to find out how to add a local my.cnf settings file for MySQL. This now lives in /Applications/MAMP/conf/my.cnf (I added into [mysqld] section a line saying ‘myisam_max_sort_file_size = 30G’ (or there-abouts). Shut down the MySQL server, create that my.cnf and restart; then confirm it read your config using ‘show variables’.

Does this work? Well I don’t know yet. But enough times I’ve searched around before and found my own notes, that I thought I should at least write this much down for my future self to find :)

Update: it worked. A data import that took 2+ weeks (before I gave up) now runs in a few hours. After the bulk of the data was imported, we see ‘Repair by sorting’ in ‘show processlist’ for a while (couple of hours for 15 million records, in my case). This is, as promised, faster than ‘Repair with keycache’. I’ve done this on two machines now (with the same data); on one of them I did notice some ‘Waiting for table metadata lock’ processes in the list, but it still successfully completed overnight.

Be your own twitter: laconi.ca microblog platform and identi.ca

The laconi.ca microblogging platform is as open as you could hope for. That elusive trinity: open source; open standards; and open content.

The project is led by Evan Prodromou (evan) of Wikitravel fame, whose company just launched identi.ca, “an open microblogging service” built with Laconica. These are fast gaining feature-parity with twitter; yesterday we got a “replies” tab; this morning I woke to find “search” working. Plenty of interesting people have  signed up and grabbed usernames. Twitter-compatible tools are emerging.

At first glance this might look the typical “clone” efforts that spring up whenever a much-loved site gets overloaded. Identi.ca‘s success is certainly related to the scaling problems at Twitter, but it’s much more important than that. Looking at FriendFeed comments about identi.ca has sometimes been a little depressing: there is too often a jaded, selfish “why is this worth my attention?” tone. But they’re missing something. Dave Winer wrote a “how to think about identi.ca” post recently; worth a read, as is the ever-wise Edd Dumbill on “Why identica is important”. This project deserves your attention if you value Twitter, or if you care about a standards-based decentralised Social Web.

I have a testbed copy at foaf2foaf.org (I’ve been collecting notes for Laconica installations at Dreamhost). It is also federated. While there is support for XMPP (an IM interface) the main federation mechanism is based on HTTP and OAuth, using the openmicroblogging.org spec. Laconica supports OpenID so you can play  without needing another password. But the OpenID usage can also help with federation and account matching across the network.

Laconica (and the identi.ca install) support FOAF by providing a FOAF files  – data that is being indexed already by Google’s Social Graph API. For eg. see  my identi.ca FOAF; and a search of Google SGAPI for my identi.ca account.  It is in PHP (and MySQL) – hacking on FOAF consumer code using ARC is a natural step. If anyone is interested to help with that, talk to me and to Evan (and to Bengee of course).

Laconica encourages everyone to apply a clear license to their microblogged posts; the initial install suggests Creative Commons Attribution 3. Other options will be added. This is important, both to ensure the integrity of this a system where posts can be reliably federated, but also as part of a general drift towards the opening up of the Web.

Imagine you are, for example, a major media content owner, with tens of thousands of audio, video, or document files. You want to know what the public are saying about your stuff, in all these scattered distributed Social Web systems. That is just about do-able. But then you want to know what you can do with these aggregated comments. Can you include them on your site? Horrible problem! Who really wrote them? What rights have they granted? The OpenID/CC combination suggests a path by which comments can find their way back to the original publishers of the content being discussed.

I’ve been posting a fair bit lately about OAuth, which I suspect may be even more important than OpenID over the next couple of years. OAuth is an under-appreciated technology piece, so I’m glad to see it being used nicely for Laconica. Laconica installations allow you to subscribe to an account from another account elsewhere in the Web. For example, if I am logged into my testbed site at http://foaf2foaf.org/bandri and I visit http://identi.ca/libby, I’ll get an option to (remote-)subscribe. There are bugs and usability problems as of right now, but the approach makes sense: by providing the url of the remote account, identi.ca can bounce me over to foaf2foaf which will ask “really want to subscribe to Libby? [y/n]“, setting up API permissioning for cross-site data flow behind the scenes.

I doubt that the openmicroblogging spec will be the last word on this kind of syndication / federation. But it is progress, practical and moving fast. A close cousin of this design is the work from the SMOB (Semantic Microblogging) project, who use SIOC, FOAF and HTTP. I’m happy to see a conversation already underway about bridging those systems.

Do please consider supporting the project. And a special note for Semantic Web (over)enthusiasts: don’t just show up and demand new RDF-related features. Either build them yourself or dive into the project as a whole. Have a nose around the buglist. There is of course plenty of scope for semwebbery, but I suggest a first priority ought to be to help the project reach a point of general usability and adoption. I’ve nothing against Twitter just as I had nothing at all against Six Apart and Movable Type, back before they opensourced. On the contrary, Movable Type was a great product from great people. But the freedoms and flexibility that opensource buys us are hard to ignore. And so I use WordPress now, having migrated like countless others. My suspicion is we’re at a “WordPress/MovableType” moment here with Identica/Laconica and Twitter, and that of all the platforms jostling to be the “new twitter”, this one is most deserving of success. With opensource, Laconica can be the new Laconica…

You can follow me here identi.ca/danbri

Commandline PHP for loading RDF URLs into ARC (and Twinkle for query UI)


#!/usr/bin/php
<?php
if ($argc != 2 || in_array($argv[1], array('--help', '-help', '-h', '-?'))) {
?>
This is a command line PHP script with one option: URL of RDF document to load
<?php
} else {

$supersecret = "123rememberme"; #Security analysts recommend using data of birth + social security ID here
# *** be careful with real msql passwords ***

include_once("../arc/ARC2.php");
$config = array( 'db_host' => 'localhost', 'db_name' => 'sg1', 'db_user' => 'sparql',
'db_pwd' => $supersecret, 'store_name' => 'crawl', );
$store = ARC2::getStore($config);
if (!$store->isSetUp()) { $store->setUp(); }
$profile = $argv[1];
echo "Loading data from " . $profile ;
$store->query('DELETE FROM <'.$profile.'>');
$store->query('LOAD <'.$profile.'>');
}
?>

FWIW, this is what I’m using to (re)load data into an ARC store from the commandline. I’ll try wiring up my old RDF crawler to this when I get time. Each loaded source is stored as a named graph, with the URI it is loaded from being the named graph URI. ARC just needs the path to the unpacked PHP libraries, and connection details for a MySQL database, and comes with a handy SPARQL endpoint script too, which I’ve been testing with Twinkle.

My public sandbox data is currently loaded up as follows. No promises it’ll stay there, but anyway, adding the following to Twinkle 2.0’s config file section for SPARQL endpoints works for me. The endpoint also directly offers a basic Web interface too, with HTML, XML, JSON etc.


<http://sandbox.foaf-project.org/2008/foaf/ggg.php>
a sources:Endpoint; rdfs:label "FOAF Social Graph Agggregator sandbox".

OpenID plugin for WordPress

I’ve just installed Alan J Castonguay’s WordPress OpenID plugin on my blog, part of a cleanup that included nuking 11000+ comments in the moderation queue using the Spam Karma 2 plugin. Apologies if I zapped any real comments too. There are a few left, at least!

The OpenID thing appears to “just work”. By which I mean, I could log in via it and leave a comment. I’d be super-grateful if those of you with OpenIDs could take a minute to leave a comment on this post, to see if it works as well as it seems to. If it doesn’t, a bug report (to danbrickley@gmail.com) would be much appreciated. Those of you with LiveJournals or AOL/AIM accounts already have OpenID, even if you didn’t notice. See the HTML source for my homepage to see how I use “danbri.org” as an OpenID while delegating the hard work to LiveJournal. For more on OpenID, check out these tutorial slides (flash/pdf) from Simon Willison and David Recordon.

Thinking about OpenID-mediated blog comments, the tempting thing then would be to do something with the accumulated URIs. The plugin keeps its data in nice SQL tables and presumably accessible by other WordPress plugins. It’s been a while since I made a WordPress plugin, but they seem to have a pretty good framework accessible to them now.

mysql> select user_id, url from wp_openid_identities;
+---------+--------------------+
| user_id | url                |
+---------+--------------------+
|      46 | http://danbri.org/ |
+---------+--------------------+
1 row in set (0.28 sec)

At the moment, it’s just me. It’d be fun to try scooping up RDF (FOAF, SKOS, SIOC, feeds…) from any OpenID URIs that accumulate there. Hmm I even wrote up that project idea a while back – SparqlPress. At the time I tried prototyping it in Redland + PHP, but nowadays I’d probably use Benjamin Nowack’s ARC library, which provides SPARQL query of a MySQL-backed RDF store, and is written in PHP. This gives it the same dependencies as WordPress, making it ideal for pluginization. If anyone’s looking for a modest-sized practical SemWeb project to hack on, that one could be a lot of fun.

There’s a lot of interesting and creative fuss about “social networking” site interop around lately, largely thanks to the social graph paper from Brad Fitzpatrick and David Recordon. I lean towards the “show me, don’t tell me” approach regarding buddylists and suchlike (as does Julian Bond with Ecademy), which is why FOAF has only ever had the mild-mannered “knows” relationship in the core vocabulary, rather than trying to over-formalise “bestest friend EVER” and other teenisms. So what I like about this WordPress plugin is that it gives some evidence-based raw material for decentralised social networking apps. Blog comments don’t tell the whole story; nothing tells the whole story. But rather than maintain a FOAF “knows” list (or blogroll, or blog-reader config) by hand, I’d prefer to be able to partially automate it by querying information about whose blogs I’ve commented on, and vice-versa. There’s a lot that could be built, intimidatingly much, that it’s hard to know where to start. I suggest that everyone in the SemWeb scene having an OpenID with a FOAF file linked from it would be an interesting platform from which to start exploring…

Meanwhile, I’ll try generating an RDF blogroll from any URIs that show up in my OpenID WordPress table, so I can generate a planetplanet or chumpologica configuration automatically…

GIS and Spatial Extensions with MySQL

GIS and Spatial Extensions with MySQL.

MySQL 4.1 introduces spatial functionality in MySQL. This article describes some of the uses of spatial extensions in a relational database, how it can be implemented in a relational database, what features are present in MySQL and some simple examples.

I’m hoping to understand the commonalities between this and PostGIS. PostGIS follows the OpenGIS “Simple Features Specification for SQL“. As do the MySQL extensions, apparently. The MySQL pages summarise the extensions as follows:

Data types. There needs to be data types to store the GIS information. This is best illustrated with an example, a POINT in a 2-dimensional system.

Operations. There must be additional operators to support the management of multi-dimensional objects, again, this is best illustrated with an example, a function that computes the AREA of a polygon of any shape.

The ability to input and output GIS data. To make systems interoperable, OGC has specified how contents of GIS objects are represented in binary and text format.

Indexing of spatial data. To use the different operators, some means of indexing of GIS data is needed, or in technical terms, spatial indexing.

I’m currently working on some ideas to prototype a new project (to fill the gap that the completion of SWAD-Europe leaves in my schedule). I’ll be revisiting my Gargonza plan to add a basic SemWeb RDF crawler to personal weblog installations, initially prototyping with Redland addons to WordPress. Ultimately, pure PHP would be better, unless Redland finds its way into the default PHP installation. Since WordPress requires MySQL anyway, it seems worth taking a look at these geo-related extensions. A more thorough investigation would take a look at reflecting GIS SQL concepts into RDF, perhaps exposing them in a SPARQL query environment. But that’s a bit ambitious for now.

What I hope to do for starters is use a blog as a personal SW crawler, scooping up RSS, FOAF, calendar, and photo descriptions from nearby Web sites. It isn’t clear yet exactly how photo metadata should most usefully be structured, but it is clear that we’ll find a way to harvest it into an RDF store. And if that metadata has mappable content, whether basic lat/long tags, richer GML, or something in between, we’ll harvest that too. My working hypothesis is that we’ll need something like MySQL spatial extensions or PostGIS to really make the most of that data, for eg. to expose location-specific, app-centric RSS, KML, etc. feeds such as those available from the flickr-derrived geobloggers.com and brainoff flickr.proxy sites. See mapufacture.com for one possible client app; Google Earth as KML browser is another.

That’s the plan anyway. So the reading list grows. Fortunately, OGC’s GIS SQL spec at least has some nice diagrams…

GIS datatype hierarchy

Missed me, huh?

Blog was broken for weeks. Note to self for next time:

Mysql password forgettage: restart it with –skip-grant-tables
and run “mysql -u root mysql”

Then…

UPDATE user SET Password=PASSWORD(‘sekretsekret’) WHERE User=’root';

…and before all this, have backed it all up.

REPAIR TABLE wordpress.wp_categories; …etc to fix up trashed tables. Then restart the mysqld with normal settings.

This, in case wasn’t clear, is a Test Post.