I have recently been doing a lot of development work for a very large Known installation. This installation is highly customised, has many active users, all doing unexpected and creative things with the platform, and makes use of many of Known’s more advanced features in often quite unexpected ways.

As with everything built by mankind, sometimes things go wrong, which is especially true with something as complicated as software. Simply waiting for a user to report a fault, and for that report to bubble up through IT/management, is poor customer service. The time between the fault being found, and the report being received, is often measured in weeks, and is often misleading/missing crucial information, leading to more time spent clarifying the fault.

So, recently, I’ve been exploring a number of ways to handle any issues in a much more proactive way, and to collect objective data, rather than subjective fault reports.

Crash reports

One thing I’ve been exploring is a simple mechanism whereby Known will send an email to one or more addresses when a fatal error or exception occurs. This email contains the details of the error, as well as who was logged in at the time.

You can try this for yourself if you’re tracking Known’s master branch by adding the following to your config.ini:

oops_notify[] = 'you@yourdomain.com'

This is great for when something blows up, but often problems are much more subtle than that. For example, what happens if a change you’ve made causes an increase in page load for certain users? How would you track back and find out when it started, and what was the change that might have caused it?

Running stats and health metrics

In the latest master build, I’ve added a mechanism to start collecting useful metrics of a running system – page build time, events tracking, instances of errors, etc – which when properly analysed will give a much more useful overview of the general health of a running system.

I’m a big fan of graphs over logs for this sort of thing.

Currently the stats handler is a dummy which throws this information away, however it’d be a simple matter to extend this functionality to use something RRDTool or StatsD, with Graphite over the top to generate the graphs.

Recording your stats

If you’re a plugin writer, you can push your own statistics using the same tool, e.g.:

$stats = \Idno\Core\Idno::site()->statistics();
if (!empty($stats)) {
   $stats->increment('myplugin.somethinghappened');
}

Give it a play!

StatsD, created by Etsy, is a simple Node.JS server that provides powerful way to collect numerous statistics about your web application, and to do so simply and quickly.

At Etsy, they graph everything, which they use to great effect.

Collecting statistics is cool because it gives you hard data about how your software is performing. This in turn gives you powerful analytical tools to dig deep into your code, and find out what’s really going on (especially true when combined with a visualisation tool such as Graphite). It let you see the impact of code or infrastructure changes, and to quickly identify problems. Perhaps most importantly, it is only when armed kind of hard data that you can even begin to grasp at the nettles of code optimisation, growth and scalability.

Introducing StatsD for Elgg

With this in mind, and because I needed to collect some stats for a couple of client projects, I’ve put up on Github the first version of an Elgg statsD module. When installed and configured, this module will interface with a statsd server and collect a whole bunch of statistics from your running Elgg site.

Out of the box the current version of the plugin can log:

  • Events & Hooks (which in turn give you things like user signup events and object creation)
  • Exceptions
  • PHP Errors, Notices and Warnings
  • Elgg popups (system messages and error messages)
  • Database calls
  • Script execution time

In addition you can record your own statistics by making a call to wrapper functions contained in the plugin itself.

All data will be logged into a custom “bucket”, which is by default derived from your Elgg site name. This lets you log statistics from multiple different sites to a common statsd server.

Installation is pretty straightforward once you’ve installed the base infrastructure. Follow the instructions for installing graphite, nodeJS and statsd from the various sites around the internet, and then upload the elgg-statsd plugin to your Elgg site’s mod directory.

Once activated, you can specify the statsd server you wish to log to and configure what statistics you want to record.

Have fun!

» Visit the project on Github…

P.S. If you try and set this up and are seeing errors in your graphite log along the lines of “create() takes at most 5 arguments (6 given)“, then you are likely falling foul of this bug.

My solution was to build Whisper from the latest code in master rather than the stable 0.9.x branch. This worked for me, but of course YMMV.