I have recently been doing a lot of development work for a very large Known installation. This installation is highly customised, has many active users, all doing unexpected and creative things with the platform, and makes use of many of Known’s more advanced features in often quite unexpected ways.

As with everything built by mankind, sometimes things go wrong, which is especially true with something as complicated as software. Simply waiting for a user to report a fault, and for that report to bubble up through IT/management, is poor customer service. The time between the fault being found, and the report being received, is often measured in weeks, and is often misleading/missing crucial information, leading to more time spent clarifying the fault.

So, recently, I’ve been exploring a number of ways to handle any issues in a much more proactive way, and to collect objective data, rather than subjective fault reports.

Crash reports

One thing I’ve been exploring is a simple mechanism whereby Known will send an email to one or more addresses when a fatal error or exception occurs. This email contains the details of the error, as well as who was logged in at the time.

You can try this for yourself if you’re tracking Known’s master branch by adding the following to your config.ini:

oops_notify[] = 'you@yourdomain.com'

This is great for when something blows up, but often problems are much more subtle than that. For example, what happens if a change you’ve made causes an increase in page load for certain users? How would you track back and find out when it started, and what was the change that might have caused it?

Running stats and health metrics

In the latest master build, I’ve added a mechanism to start collecting useful metrics of a running system – page build time, events tracking, instances of errors, etc – which when properly analysed will give a much more useful overview of the general health of a running system.

I’m a big fan of graphs over logs for this sort of thing.

Currently the stats handler is a dummy which throws this information away, however it’d be a simple matter to extend this functionality to use something RRDTool or StatsD, with Graphite over the top to generate the graphs.

Recording your stats

If you’re a plugin writer, you can push your own statistics using the same tool, e.g.:

$stats = \Idno\Core\Idno::site()->statistics();
if (!empty($stats)) {
   $stats->increment('myplugin.somethinghappened');
}

Give it a play!

2 thoughts on “Collecting performance/health metrics in Known

  1. Further to my last post, I’d like to introduce you to a working implementation of the stats gathering mechanism using Etsy StatsD and NodeJS.
    StatsD is a Node.JS stats server created by the people at etsy to provide a simple way of logging useful statistics from software. These statistics are an invaluable way of monitoring the performance of your application, monitoring the performance of software changes and diagnosing faults.
    This plugin gives you an overview of what is happening in your Known install by logging important system level things – events, errors, exceptions etc. This lets you get a very clear idea of how your Known network is performing, and quickly see the effect that changes have on your users.
    Installation
    Install Node.JS, either from github or the package manager for your OS
    Install StatsD
    Not required, but highly recommended, install a Graphite server for graph visualisation
    Place this plugin in IdnoPlugins/StatsD
    Add the following to your config.ini

    statistics_collector = IdnoPluginsStatsDStatsDStatisticsCollector;
    statsd_enabled = true;


    12

    statistics_collector = IdnoPluginsStatsDStatsDStatisticsCollector;statsd_enabled = true;


    Optionally, you can specify one or more of the following extra options, (although the defaults are usually ok):



    statsd_host = localhost
    statsd_port = 8125
    statsd_bucket = some_name
    statsd_samplerate = 1


    1234

    statsd_host = localhoststatsd_port = 8125statsd_bucket = some_namestatsd_samplerate = 1


    statsd_samplerate is handy on really busy systems (see Statsd’s notes on the subject), but in a nutshell, setting this to something like 0.1 (capture one in every 10 count or timer events) is handy if you find StatsD being overloaded.
    If everything is working, you should now be happily graphing some useful stats.
    » Visit the project on Github…



    Thanks for visiting! If you’re new here you might like to read a bit about me.
    (Psst… I am also available to hire! Find out more…)


    Follow @mapkyca
    !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?’http’:’https’;if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+’://platform.twitter.com/widgets.js’;fjs.parentNode.insertBefore(js,fjs);}}(document, ‘script’, ‘twitter-wjs’);


    Share this:EmailLinkedInTwitterGoogleFacebookReddit

  2. Writing good software is hard, and often things go wrong in unexpected ways when software is deployed to live. A mantra that I’ve been trying to live by is to “never rely on your users/clients/customers reporting problems”, if anything this should be the absolute last thing to rely on.
    I’ve previously talked about how I have been deploying metrics and fault reporting code for all my clients, regardless of what software they’re running, and how I’ve built support for these into Known core.
    These reports produce a detailed insight into the code as it runs under real usage, as well as a detailed bug log should something fail. The fault reporting messages in particular have possibly been one of the most useful things I’ve ever done, and have already been responsible for discovering a number of rare failure modes on software that had thus far neither shown up in testing or had been reported.
    Today I had a particularly thorny Javascript issue to debug, and it got me thinking about how I could capture javascript errors from client browsers, have errors logged in a central way, and even get metric and fault reporting as I currently do with PHP errors.
    Turns out it’s actually quite easy, and in a nutshell you need to:
    Write an endpoint which will log your error message in the appropriate way, sending a fault report as necessary
    Write a bit of javascript code to call this endpoint, and listen to an error event, e.g.



    window.addEventListener(‘error’, function (error) {
    var stack = error.error.stack;
    var message = error.error.toString();

    if (stack)
    message += ‘n’ + stack;

    … call your ajax endpoint…
    });


    123456789

    window.addEventListener(‘error’, function (error) {     var stack = error.error.stack;    var message = error.error.toString();     if (stack)     message += ‘n’ + stack;        ... call your ajax endpoint...});


    The latest build of Known now has support for this: now, Javascript messages will get logged in your normal logs, metrics counted, and if you’ve enabled oops reporting, you’ll get an email when a client triggers a Javascript error.
    Hopefully you’ll find this as useful as I have!


    Thanks for visiting! If you’re new here you might like to read a bit about me.
    (Psst… I am also available to hire! Find out more…)


    Follow @mapkyca
    !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?’http’:’https’;if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+’://platform.twitter.com/widgets.js’;fjs.parentNode.insertBefore(js,fjs);}}(document, ‘script’, ‘twitter-wjs’);


    Share this:EmailLinkedInTwitterGoogleFacebookReddit

Leave a Reply