“The Net interprets censorship as damage and routes around it.” John Gilmore, Time Magazine 6th December 1993

This quote – made almost 16 years ago – sums up in a nutshell why I love the internet sometimes.

As is obvious from the ongoing events this morning that the law firm Carter-Ruck didn’t really understand just how badly it was going to shoot itself in the foot when it gagged the Guardian newspaper in an attempt to prevent them reporting on open questions asked in parliament.

These questions referred to the Minton Report regarding illegal toxic waste dumping.

I guess we should really thank them, because had they not done I wouldn’t have this delicious feeling of schadenfreude as thousands of people find out about their client Trafigura illegally dumping toxic waste off the Ivory Coast, in possibly the largest toxic waste scandal of the 21st century.

The story broke this morning, and has been widely circulated around blogs and twitter, passed around like a note in a giant electronic classroom (Interestingly, at time of writing at least, the BBC have not picked up the story. Make of that what you will).

The internet is people (as my esteemed friend says so often), and when people are connected secrets become much harder to keep, and cover-ups much harder to orchestrate.

People power ftw.

Update: The gag order on the Guardian has been lifted shortly before they were due to appear in the high court.

Could the shitestorm generated could possibly have something to do with it..?

or maybe not.

Back in July I gave a talk at Oxford Geek Nights about the Digital Britain report entitled “#DigitalBritain fail” in which I discussed the Digital Britain report and some of it’s many shortcomings.

One of the potential courses of action I suggested that people could take was to essentially smile,  say “that’s nice dear” and continue innovating. To take the typically open source approach adopted by the guys at Open Streetmap (among others) and recreate proprietary datasets in the public domain.

I was therefore delighted when I came across the guys at Ernest Marples, who were attempting to provide a free version of the Postcode to location database.

As a bit of background; in the UK the state (via Royal mail holdings for which the state is the sole shareholder) has a monopoly on all postcode to location lookups. This monopoly is protected by crown copyright and a royal charter, which basically means that even though the dataset was produced using taxpayer’s money it is owned by the crown (in the case of crown copyright), and the charter means that nobody else is permitted to provide the same service.

This means that in order to do anything with postcodes you need to pay a licence fee to the post office, pricing the small players out of the game or limiting them to use a service provider such as Yahoo (which has it’s own terms of usage). A similar situation exists for geolocation in general, but in this instance you have to pay the Ordnance Survey.

This situation is archaic and was a hot topic at Barcamp Transparency. Data which are produced by taxpayer money should be freely available to all, and I had hoped that the dissolution of crown copyright would have been one of the first thing that the Digital Britain report recommended.

Yesterday, Ernest Marples announced in their blog that they were shutting down their service in the face of a legal challenge from Royal Mail, who pretty much accused them of stealing their database. Although the Ernest Marples guys were a little cagey about where they got their data (with hindsight this was probably a mistake) they did explicitly state that it was not using the Royal Mail database in any way.

Under the terms of the charter however, they are simply not permitted to provide this service and compete with Royal mail, and this is the basis of the legal challenge.

I am saddened to see this promising project go, and especially sorry to see that they don’t have the funds to get their day in court. A court case of this nature could provide a useful forum to hold a long overdue debate as to the relevance of the charter and crown copyright in general in the twenty-first century.

Crown copyright is a problem (as well as being morally dubious), and a monopoly is always bad (especially when state enforced). It is sad to see promising UK innovation stifled by entrenched interests, but it seems to be a reoccuring theme in modern Britain. As we have just seen it puts severe limits on just how far a project can go in opening up and recreating data sets, and this worries me.

I wish the project and it’s organisers all the best for the future.

Top image “postbox_20may2009_0830” by Patrick H. Lauke

I have recently been exploring some aspects of the Elgg scalability question by exploring how easy it would be to get the latest version of Elgg (1.6) running on a MySQL cluster.

In this article I will document the process, but first I should point out:

  • This is highly experimental and not endorsed in any way.
  • It is built against Elgg 1.6.1
  • This is not canonical and doesn’t reflect anything to do with the roadmap
  • This has not been extensively tested so caveat emptor.

Setting up the cluster

The first step is to set up the cluster on your equipment.

A MySQL cluster consists of a management node and several data nodes connected together by a network. Because I was running rather low on hardware, I cheated here and created each node as a Virtual Box image on my laptop – but the principle is the same.

Each node is an Ubuntu install (although you can use pretty much any OS) with two (virtual) network cards, one connected to the wider network (to install packages) and another on an internal network. If you do this for real you should consider removing the internet facing card once you’ve set everything up since a cluster isn’t secure enough to be run on the wider internet.

In my test configuration I had three nodes with name/internal IP as follows:

  • HHCluster1/192.168.2.1 – Management node & web server
  • HHCluster2/192.168.2.2 – First data node
  • HHCluster3/192.168.2.3 – Second data node

HHCluster1 – The management node

Install mysql, apache etc. This should be a simple matter of apt-getting the relevant packages. Clustering (ndb) support is built into the version of mysql bundled with Ubuntu, but this may not be the case universally so check!

You need to create a file in /etc/mysql/ called ndb_mgmd.cnf, this should contain the following:


[NDBD DEFAULT]
NoOfReplicas=2 # How many nodes you have
DataMemory=80M # How much memory to allocate for data storage (change for larger clusters)
IndexMemory=18M # How much memory to allocate for index storage (change for larger clusters)
[MYSQLD DEFAULT]
[NDB_MGMD DEFAULT]
[TCP DEFAULT]

[NDB_MGMD]
HostName=192.168.2.1 # IP address of this system

# Now we describe each node on the system

# First data node
HostName=192.168.2.2
DataDir=/var/lib/mysql-cluster
BackupDataDir=/var/lib/mysql-cluster/backup
DataMemory=512M
[NDBD]
# Second data node node
HostName=192.168.2.3
DataDir=/var/lib/mysql-cluster
BackupDataDir=/var/lib/mysql-cluster/backup
DataMemory=512M

#one [MYSQLD] per data storage node
[MYSQLD]
[MYSQLD]

Data nodes (HHCluster2 & 3)
You must now configure your data nodes:

  1. Create the data directories, as root type:

    mkdir -p /var/lib/mysql-cluster/backup
    chown -R mysql:mysql /var/lib/mysql-cluster

  2. Edit your /etc/mysql/my.cnf and add the following to the [mysqld] section:

    ndbcluster
    # Replace the following with the IP address of your management server
    ndb-connectstring=192.168.2.1

  3. Again in /etc/mysql/my.cnf uncomment and edit the [MYSQL_CLUSTER] section so it contains the location of your management server:

    [MYSQL_CLUSTER]
    ndb-connectstring=192.168.2.1

  4. You need to create your database on each node (this is because clustering operates on a table level rather than a database level):

    CREATE DATABASE elggcluster;

Starting the cluster

  1. Start the management node:

    /etc/init.d/mysql-ndb-mgm start

  2. Start your data nodes:

    /etc/init.d/mysql restart
    /etc/init.d/mysql-ndb restart

Verifying the cluster
You should now have the cluster up and running, you can verify this by logging into your management node and typing show in ndb_mgm.

A word on access…

The cluster is now set up and will replicate tables (created with the ndbcluster engine – more on that later), but that is only useful to a point. Right now we don’t have a single endpoint to direct queries to, so this direction needs to be done at the application level.

We could take advantage of Elgg’s built in split read and writes, but this would only allow us to use a maximum of two nodes. A better solution would be to use a load balancer here such as Ultramonkey to direct the query to the appropriate server allowing us to scale much further.

I didn’t really have time to get into this, so I am using the somewhat simpler mysql-proxy.

  1. On HHCluster1 install and run mysql-proxy:

    apt-get install mysql-proxy
    mysql-proxy --proxy-backend-addresses=192.168.2.2:3306 --proxy-backend-addresses=192.168.2.3:3306

  2. On your data nodes edit your /etc/mysql/my.cnf file. Find bind-address and change its IP to the node’s IP address. Also ensure that you have commented out any occurrence of skip-networking.
  3. Again on your client nodes, log in to mysql and grant access to your cluster table to a user on HHCluster1 – for example:

    GRANT ALL ON elggcluster.* TO `root`@`HHCluster1.local` IDENTIFIED BY '[some password]'

Installing elgg

Unfortunately as it stands, you need to make some code changes to the vanilla version of Elgg in order for it to work in a clustered environment. These changes are necessary because of the restrictions placed on us by the ndbcluster engine.

Two things in particular cause us problems – ndbcluster doesn’t support FULLTEXT indexes, and it also doesn’t support indexes over TEXT or BLOB fields.

FULLTEXT is for searching and is largely not used in the vanilla install of elgg, so I removed them. Equally, most indexes blobs one can live without, the exception being on the metastrings table.

Metastrings is accessed a lot, so the index is critical. Therefore I added an extra varchar field which we’ll modify the code to include the first 50 characters of the indexed text – this is equivalent to the existing index:

CREATE TABLE `prefix_metastrings` (
`id` int(11) NOT NULL auto_increment,
`string` TEXT NOT NULL,
`string_index` varchar(50) NOT NULL,
PRIMARY KEY (`id`),
KEY `string_index` (`string_index`)
) ENGINE=ndbcluster DEFAULT CHARSET=utf8;

And the modified query:

$row = get_data_row("SELECT * from {$CONFIG->dbprefix}metastrings where string=$cs'$string' and string_index='$string_index' limit 1");

Mysql’s optimiser checks the index first so this doesn’t lose a significant amount of efficiency (at least according to the explain command).

» Modified schema

The next problem is that the system log currently uses INSERT DELAYED to insert the log data. This is also not supported under the clustered engine.

There are a number of approaches we could take including using Elgg’s delayed write functionality or writing a plugin which replaces and logs to a different location.

For the purposes of this test I decided to just comment out the code in system_log().

What won’t work
Currently there are a couple of core things that won’t work under these changes, here is a by no means complete summary:

  • The system log (as previously described). This isn’t too much of a show stopper as the river code introduced in Elgg 1.5 no longer uses this.
  • The log rotate plugin as this attempts to copy the table into the archive engine type and we can’t guarantee which node it will be executed on in this scenario.
  • Any third party plugins which attempt to access the metastrings table directly (of which there should be none as direct table access is a big no no!)

Anyway, here is a patch I made against the released version of 1.6.1 with all the code changes I made. Once you have applied this patch to your Elgg install you should be able to proceed with the normal Elgg install.

Let me know any feedback you may have!

» Elgg Clustering patch for Elgg 1.6.1

Top image “Birds-eye view of the 10,240-processor SGI Altix supercomputer housed at the NASA Advanced Supercomputing facility.”