The majority of web servers retain a vast amount of data about their visitors in the form of log files. Other processes running on the server, like the system log, MTA log, etc, also store a raft of information.

These logs are typically retained (although often rotated at regular intervals to save space) basically until the admin is looking to reclaim some disk space or the server is reinstalled, so, from a practical standpoint that’s “forever”. This is very much part of the tech industry’s dataholic “collect everything” culture, which I’m personally trying to wean myself off of.

Thing is, at first glance, retention seems like such a good idea (and limited retention can be, more on that later). You need logs to find out how your server is performing, and what if something goes wrong? However, they’re mostly just noise, and they go stale very quickly… when was the last time you needed to look at a 4 month old apache log file?

The reality is that the vast majority of the time you’re only really interested in the last couple of lines. Why keep the rest?

What question are you trying to answer?

Log files have there use; they are invaluable to diagnose specific and immediate problems along the lines of “My web site keeps giving me a white page!”, or “Why on earth won’t my firewall start?”, or “What was the last thing Apache did before it crashed?”.

However, to answer the perhaps more useful questions like, “Am I seeing increased traffic?” or “Are my hard drives healthy?”, or even esoteric questions like, “Did spring cleaning my server save me money?“, your raw logs really aren’t going to be much use to you.

To answer the questions you’re really interested in, you’re going to have to cook this data into something tasty.

What I do…

This is the approach I’m currently using for myself, and which I been recommending to my clients. Obviously you need to adjust this based on specific requirements, for example, one client I had in the past had a legal requirement to retain all logs off line (of course nobody ever looked at them but rules is rules).

  1. Retain raw logs for a day: keep your raw logs for a short period of time, this will let you get at the raw text of any error messages should anything on your server die.
  2. Run an infrastructure monitoring tool: instead of keeping raw logs, what you should be keeping is the higher level statistical information that is produced by analysing your logs (and other sources) produced by a tool like munin. These results have all the noise (and any sensitive information) removed, and are far better at helping you diagnose problems.

Using this approach I was in the past able to, among many other things:

  • Spot a failing hard drive on a customer’s server before it became a problem (because over time the frequency of errors on that specific drive was increasing).
  • Optimise caches within a feedback loop (I could track configuration changes with a corresponding increase or decrease in cached pages served).
  • Isolate the cause of an intermittent failure on a client site (by seeing what the server was doing at the time of the outage, I could see that the mysql query cache was becoming full causing queries to run slowly and apache to block).
  • Link an increasing number of errors back to a configuration change made months ago (I had logged the time and date of the config change, and could look back at my graphs to see that I first started seeing problems after this time. Reverted the change and everything was a-ok).
  • …etc…

In each case the information was in the raw logs, but good luck trying to find it.

There are many tools out there that can help you, but the basic principle is the same – process your logs into a more usable statistical form from which it’s easy to gain insights from, and ditch the unnecessary raw logs which are mostly noise.

2 thoughts on “Most log retention is pointless

  1. It was most enlightening when I went to an Elasticsearch meetup at the start of the year for a financial institution that described how they were managing to store terabytes of data per day that was still interrogatable, keeping a 9 petabyte log store for more involved historical storage. Who knows whether any of it was useful!
    Whilst I certainly agree that any logs over 3 months are totally worthless, I think the biggest problem is a general lack of care and attention in logging in the first place. In fact, I’ve generally been discouraged from making any modifications or doing any logging at all in recent years, therein reducing the problem of retention and effort all in one step!

  2. I have recently been doing a lot of development work for a very large Known installation. This installation is highly customised, has many active users, all doing unexpected and creative things with the platform, and makes use of many of Known’s more advanced features in often quite unexpected ways.
    As with everything built by mankind, sometimes things go wrong, which is especially true with something as complicated as software. Simply waiting for a user to report a fault, and for that report to bubble up through IT/management, is poor customer service. The time between the fault being found, and the report being received, is often measured in weeks, and is often misleading/missing crucial information, leading to more time spent clarifying the fault.
    So, recently, I’ve been exploring a number of ways to handle any issues in a much more proactive way, and to collect objective data, rather than subjective fault reports.
    Crash reports
    One thing I’ve been exploring is a simple mechanism whereby Known will send an email to one or more addresses when a fatal error or exception occurs. This email contains the details of the error, as well as who was logged in at the time.
    You can try this for yourself if you’re tracking Known’s master branch by adding the following to your config.ini:



    oops_notify[] = ‘you@yourdomain.com’


    1

    oops_notify[] = ‘you@yourdomain.com’


    This is great for when something blows up, but often problems are much more subtle than that. For example, what happens if a change you’ve made causes an increase in page load for certain users? How would you track back and find out when it started, and what was the change that might have caused it?
    Running stats and health metrics
    In the latest master build, I’ve added a mechanism to start collecting useful metrics of a running system – page build time, events tracking, instances of errors, etc – which when properly analysed will give a much more useful overview of the general health of a running system.
    I’m a big fan of graphs over logs for this sort of thing.
    Currently the stats handler is a dummy which throws this information away, however it’d be a simple matter to extend this functionality to use something RRDTool or StatsD, with Graphite over the top to generate the graphs.
    Recording your stats
    If you’re a plugin writer, you can push your own statistics using the same tool, e.g.:



    $stats = IdnoCoreIdno::site()->statistics();
    if (!empty($stats)) {
    $stats->increment(‘myplugin.somethinghappened’);
    }


    1234

    $stats = IdnoCoreIdno::site()->statistics();if (!empty($stats)) {   $stats->increment(‘myplugin.somethinghappened’);}


    Give it a play!


    Thanks for visiting! If you’re new here you might like to read a bit about me.
    (Psst… I am also available to hire! Find out more…)


    Follow @mapkyca
    !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?’http’:’https’;if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+’://platform.twitter.com/widgets.js’;fjs.parentNode.insertBefore(js,fjs);}}(document, ‘script’, ‘twitter-wjs’);


    Share this:EmailLinkedInTwitterGoogleFacebookReddit

Leave a Reply