StatsD, created by Etsy, is a simple Node.JS server that provides powerful way to collect numerous statistics about your web application, and to do so simply and quickly.

At Etsy, they graph everything, which they use to great effect.

Collecting statistics is cool because it gives you hard data about how your software is performing. This in turn gives you powerful analytical tools to dig deep into your code, and find out what’s really going on (especially true when combined with a visualisation tool such as Graphite). It let you see the impact of code or infrastructure changes, and to quickly identify problems. Perhaps most importantly, it is only when armed kind of hard data that you can even begin to grasp at the nettles of code optimisation, growth and scalability.

Introducing StatsD for Elgg

With this in mind, and because I needed to collect some stats for a couple of client projects, I’ve put up on Github the first version of an Elgg statsD module. When installed and configured, this module will interface with a statsd server and collect a whole bunch of statistics from your running Elgg site.

Out of the box the current version of the plugin can log:

  • Events & Hooks (which in turn give you things like user signup events and object creation)
  • Exceptions
  • PHP Errors, Notices and Warnings
  • Elgg popups (system messages and error messages)
  • Database calls
  • Script execution time

In addition you can record your own statistics by making a call to wrapper functions contained in the plugin itself.

All data will be logged into a custom “bucket”, which is by default derived from your Elgg site name. This lets you log statistics from multiple different sites to a common statsd server.

Installation is pretty straightforward once you’ve installed the base infrastructure. Follow the instructions for installing graphite, nodeJS and statsd from the various sites around the internet, and then upload the elgg-statsd plugin to your Elgg site’s mod directory.

Once activated, you can specify the statsd server you wish to log to and configure what statistics you want to record.

Have fun!

» Visit the project on Github…

P.S. If you try and set this up and are seeing errors in your graphite log along the lines of “create() takes at most 5 arguments (6 given)“, then you are likely falling foul of this bug.

My solution was to build Whisper from the latest code in master rather than the stable 0.9.x branch. This worked for me, but of course YMMV.

By default, the standard LAMP (Linux Apache Mysql Php/Perl/Python) stack doesn’t come particularly well optimised for handling more than a trivial amount of load. For most people this isn’t a problem, either they’re running on a large enough server or their traffic is at a level that they never hit against the limits.

Anyway, I’ve hit against these limits on a number of occasions now, and while there are many good articles out there on the subject, I thought I’d write down my notes. For my own sake as much as anything else…

Apache

Apache’s default configuration on most Linux distributions is not the most helpful, and you’re goal here is to do everything possible to avoid the server having to hit the swap and start thrashing.

  • MaxClients – The important one. If this is too high, apache will merrily spawn new servers to handle new requests, which is great until the server runs out of memory and dies. Rule of thumb:

    MaxClients = (Memory - other running stuff) / average size of apache process.

    If you’re serving dynamic PHP pages or pull a lot of data from databases etc the amount of memory a process takes up can quickly balloon to a very large value – sometimes as much as 15-20mb in size. Over time all running Apache processes will be the size of your largest script.

  • MaxRequestsPerChild – Setting this to a non-zero value will cause these large spawned processes to eventually die and free their memory. Generally this is a good thing, but set the value fairly high, say a few thousand.
  • KeepAliveTimeout – By default, apache keeps connections open for 15 seconds waiting for subsequent connections from the same client. This can cause processes to sit around, eating up memory and resources which could be used for incoming requests.
  • KeepAlive – If your average number of requests from different IP addresses is greater than the value of MaxClients (as it is in most typical thundering herd slashdottings), strongly consider turning this off.

Caching

  • SquidSquid Reverse Proxy sits on your server and caches requests, turning expensive dynamic pages into simple static ones, meaning that at periods of high load, requests never need to touch apache. Configuration seems complex at first, but all that is really required is to run apache on a different port (say 8080), run squid on port 80 and configure apache as a caching peer, e.g.


    http_port 80 accel defaultsite=www.mysite.com vhost
    cache_peer 127.0.0.1 parent 81 0 no-query originserver login=PASS name=myAccel

    One gotcha I found is that you have to name domains you’ll accept proxying for, otherwise you’ll get a bunch of Access Denied errors, meaning that in a vhost environment with multiple domains this can be a bit fiddly.

    A workaround is to specify an ACL with the toplevel domains specified, e.g.

    acl our_sites dstdomain .uk .com .net .org

    http_access allow our_sites
    cache_peer_access myAccel allow our_sites

  • PHP code cache – Opcode caching can boost performance by caching compiled PHP. There are a number out there, but I use xcache, purely because it was easily apt-gettable.

PHP

It goes without saying that you’d probably want to make your website code as optimal as possible, but don’t spend too much energy over this – there are lower hanging fruit, and as a rule of thumb memory and CPU is cheap when compared to developer resources.

That said, PHP is full of happy little gotchas, so…

  • Chunk output – If your script makes use of output buffering (which Elgg does, and a number of other frameworks do too), be sure that when you finally echo the buffer you do it in chunks.

    Turns out (and this bit us on the bum when building Elgg) there is a bug/feature/interaction between Apache and PHP (some internal buffer that gets burst or something) which can add multiple seconds onto a page delivery if you attempt to output large blocks of data all at once.

  • Avoid calling array_merge in a loop – When profiling Elgg some time ago I discovered that array_merge was (and I believe still is) horrifically expensive. The function does a lot of validation which in most cases isn’t necessary and calling it in a loop is ruinous. Consider using the “+” operator instead.
  • ProfileProfile your code using x-debug, find out where the bottlenecks are, you’d be surprised what is expensive and what isn’t (see the previous point).

Non-exclusive list, hope it helps!

I have recently been exploring some aspects of the Elgg scalability question by exploring how easy it would be to get the latest version of Elgg (1.6) running on a MySQL cluster.

In this article I will document the process, but first I should point out:

  • This is highly experimental and not endorsed in any way.
  • It is built against Elgg 1.6.1
  • This is not canonical and doesn’t reflect anything to do with the roadmap
  • This has not been extensively tested so caveat emptor.

Setting up the cluster

The first step is to set up the cluster on your equipment.

A MySQL cluster consists of a management node and several data nodes connected together by a network. Because I was running rather low on hardware, I cheated here and created each node as a Virtual Box image on my laptop – but the principle is the same.

Each node is an Ubuntu install (although you can use pretty much any OS) with two (virtual) network cards, one connected to the wider network (to install packages) and another on an internal network. If you do this for real you should consider removing the internet facing card once you’ve set everything up since a cluster isn’t secure enough to be run on the wider internet.

In my test configuration I had three nodes with name/internal IP as follows:

  • HHCluster1/192.168.2.1 – Management node & web server
  • HHCluster2/192.168.2.2 – First data node
  • HHCluster3/192.168.2.3 – Second data node

HHCluster1 – The management node

Install mysql, apache etc. This should be a simple matter of apt-getting the relevant packages. Clustering (ndb) support is built into the version of mysql bundled with Ubuntu, but this may not be the case universally so check!

You need to create a file in /etc/mysql/ called ndb_mgmd.cnf, this should contain the following:


[NDBD DEFAULT]
NoOfReplicas=2 # How many nodes you have
DataMemory=80M # How much memory to allocate for data storage (change for larger clusters)
IndexMemory=18M # How much memory to allocate for index storage (change for larger clusters)
[MYSQLD DEFAULT]
[NDB_MGMD DEFAULT]
[TCP DEFAULT]

[NDB_MGMD]
HostName=192.168.2.1 # IP address of this system

# Now we describe each node on the system

# First data node
HostName=192.168.2.2
DataDir=/var/lib/mysql-cluster
BackupDataDir=/var/lib/mysql-cluster/backup
DataMemory=512M
[NDBD]
# Second data node node
HostName=192.168.2.3
DataDir=/var/lib/mysql-cluster
BackupDataDir=/var/lib/mysql-cluster/backup
DataMemory=512M

#one [MYSQLD] per data storage node
[MYSQLD]
[MYSQLD]

Data nodes (HHCluster2 & 3)
You must now configure your data nodes:

  1. Create the data directories, as root type:

    mkdir -p /var/lib/mysql-cluster/backup
    chown -R mysql:mysql /var/lib/mysql-cluster

  2. Edit your /etc/mysql/my.cnf and add the following to the [mysqld] section:

    ndbcluster
    # Replace the following with the IP address of your management server
    ndb-connectstring=192.168.2.1

  3. Again in /etc/mysql/my.cnf uncomment and edit the [MYSQL_CLUSTER] section so it contains the location of your management server:

    [MYSQL_CLUSTER]
    ndb-connectstring=192.168.2.1

  4. You need to create your database on each node (this is because clustering operates on a table level rather than a database level):

    CREATE DATABASE elggcluster;

Starting the cluster

  1. Start the management node:

    /etc/init.d/mysql-ndb-mgm start

  2. Start your data nodes:

    /etc/init.d/mysql restart
    /etc/init.d/mysql-ndb restart

Verifying the cluster
You should now have the cluster up and running, you can verify this by logging into your management node and typing show in ndb_mgm.

A word on access…

The cluster is now set up and will replicate tables (created with the ndbcluster engine – more on that later), but that is only useful to a point. Right now we don’t have a single endpoint to direct queries to, so this direction needs to be done at the application level.

We could take advantage of Elgg’s built in split read and writes, but this would only allow us to use a maximum of two nodes. A better solution would be to use a load balancer here such as Ultramonkey to direct the query to the appropriate server allowing us to scale much further.

I didn’t really have time to get into this, so I am using the somewhat simpler mysql-proxy.

  1. On HHCluster1 install and run mysql-proxy:

    apt-get install mysql-proxy
    mysql-proxy --proxy-backend-addresses=192.168.2.2:3306 --proxy-backend-addresses=192.168.2.3:3306

  2. On your data nodes edit your /etc/mysql/my.cnf file. Find bind-address and change its IP to the node’s IP address. Also ensure that you have commented out any occurrence of skip-networking.
  3. Again on your client nodes, log in to mysql and grant access to your cluster table to a user on HHCluster1 – for example:

    GRANT ALL ON elggcluster.* TO root@HHCluster1.local IDENTIFIED BY '[some password]'

Installing elgg

Unfortunately as it stands, you need to make some code changes to the vanilla version of Elgg in order for it to work in a clustered environment. These changes are necessary because of the restrictions placed on us by the ndbcluster engine.

Two things in particular cause us problems – ndbcluster doesn’t support FULLTEXT indexes, and it also doesn’t support indexes over TEXT or BLOB fields.

FULLTEXT is for searching and is largely not used in the vanilla install of elgg, so I removed them. Equally, most indexes blobs one can live without, the exception being on the metastrings table.

Metastrings is accessed a lot, so the index is critical. Therefore I added an extra varchar field which we’ll modify the code to include the first 50 characters of the indexed text – this is equivalent to the existing index:

CREATE TABLE prefix_metastrings (
id int(11) NOT NULL auto_increment,
string TEXT NOT NULL,
string_index varchar(50) NOT NULL,
PRIMARY KEY (id),
KEY string_index (string_index)
) ENGINE=ndbcluster DEFAULT CHARSET=utf8;

And the modified query:

$row = get_data_row("SELECT * from {$CONFIG->dbprefix}metastrings where string=$cs'$string' and string_index='$string_index' limit 1");

Mysql’s optimiser checks the index first so this doesn’t lose a significant amount of efficiency (at least according to the explain command).

» Modified schema

The next problem is that the system log currently uses INSERT DELAYED to insert the log data. This is also not supported under the clustered engine.

There are a number of approaches we could take including using Elgg’s delayed write functionality or writing a plugin which replaces and logs to a different location.

For the purposes of this test I decided to just comment out the code in system_log().

What won’t work
Currently there are a couple of core things that won’t work under these changes, here is a by no means complete summary:

  • The system log (as previously described). This isn’t too much of a show stopper as the river code introduced in Elgg 1.5 no longer uses this.
  • The log rotate plugin as this attempts to copy the table into the archive engine type and we can’t guarantee which node it will be executed on in this scenario.
  • Any third party plugins which attempt to access the metastrings table directly (of which there should be none as direct table access is a big no no!)

Anyway, here is a patch I made against the released version of 1.6.1 with all the code changes I made. Once you have applied this patch to your Elgg install you should be able to proceed with the normal Elgg install.

Let me know any feedback you may have!

» Elgg Clustering patch for Elgg 1.6.1

Top image “Birds-eye view of the 10,240-processor SGI Altix supercomputer housed at the NASA Advanced Supercomputing facility.”