The majority of web servers retain a vast amount of data about their visitors in the form of log files. Other processes running on the server, like the system log, MTA log, etc, also store a raft of information.

These logs are typically retained (although often rotated at regular intervals to save space) basically until the admin is looking to reclaim some disk space or the server is reinstalled, so, from a practical standpoint that’s “forever”. This is very much part of the tech industry’s dataholic “collect everything” culture, which I’m personally trying to wean myself off of.

Thing is, at first glance, retention seems like such a good idea (and limited retention can be, more on that later). You need logs to find out how your server is performing, and what if something goes wrong? However, they’re mostly just noise, and they go stale very quickly… when was the last time you needed to look at a 4 month old apache log file?

The reality is that the vast majority of the time you’re only really interested in the last couple of lines. Why keep the rest?

What question are you trying to answer?

Log files have there use; they are invaluable to diagnose specific and immediate problems along the lines of “My web site keeps giving me a white page!”, or “Why on earth won’t my firewall start?”, or “What was the last thing Apache did before it crashed?”.

However, to answer the perhaps more useful questions like, “Am I seeing increased traffic?” or “Are my hard drives healthy?”, or even esoteric questions like, “Did spring cleaning my server save me money?“, your raw logs really aren’t going to be much use to you.

To answer the questions you’re really interested in, you’re going to have to cook this data into something tasty.

What I do…

This is the approach I’m currently using for myself, and which I been recommending to my clients. Obviously you need to adjust this based on specific requirements, for example, one client I had in the past had a legal requirement to retain all logs off line (of course nobody ever looked at them but rules is rules).

  1. Retain raw logs for a day: keep your raw logs for a short period of time, this will let you get at the raw text of any error messages should anything on your server die.
  2. Run an infrastructure monitoring tool: instead of keeping raw logs, what you should be keeping is the higher level statistical information that is produced by analysing your logs (and other sources) produced by a tool like munin. These results have all the noise (and any sensitive information) removed, and are far better at helping you diagnose problems.

Using this approach I was in the past able to, among many other things:

  • Spot a failing hard drive on a customer’s server before it became a problem (because over time the frequency of errors on that specific drive was increasing).
  • Optimise caches within a feedback loop (I could track configuration changes with a corresponding increase or decrease in cached pages served).
  • Isolate the cause of an intermittent failure on a client site (by seeing what the server was doing at the time of the outage, I could see that the mysql query cache was becoming full causing queries to run slowly and apache to block).
  • Link an increasing number of errors back to a configuration change made months ago (I had logged the time and date of the config change, and could look back at my graphs to see that I first started seeing problems after this time. Reverted the change and everything was a-ok).
  • …etc…

In each case the information was in the raw logs, but good luck trying to find it.

There are many tools out there that can help you, but the basic principle is the same – process your logs into a more usable statistical form from which it’s easy to gain insights from, and ditch the unnecessary raw logs which are mostly noise.

Anyone who has done any development with PHP will be familiar with the infamous White Screen of Death; a blank browser window indicating that something horrible has gone wrong. A common cause, for me at least, is making a method call on a null object – easy to do in an object oriented architecture.

The exact reason as to why your script has gone splat will be reported in the log file, but from a UX standpoint, giving a blank screen to your customers is far from ideal. It is particularly problematic in complicated platforms like Elgg and Known, which use output buffering and have a plugin architecture.

Here’s a quick bit of code which can catch many (all?) of these fatal errors, and at least echo something. A variation of this is already in Known. Place the code somewhere towards the start of your script…

Hope this is useful to you!