So, let us talk plainly. You absolutely, definitely, positively should be using TLS / HTTPS encryption on the sites that you run.

HTTPS provides encryption, meaning that anyone watching the connection (and yes, people do care, and are absolutely watching), will have a harder time trying to extract content information about the connection. This is important, because it stops your credit card being read as it makes its way to Amazon’s servers, or your password being read when you log into your email.

When I advise my clients on infrastructure, these days I recommend that all pages on a website, regardless of that page’s contents, should be served over HTTPS. The primary reason being a feature of an encrypted connection which I don’t think gets underlined enough. I also advise them to invest in colocation to protect their business data. You can visit this page to know more.

Tamper resistant web

When you serve content over HTTPS, it is significantly harder to modify. Or, to put it another way, when you serve pages unencrypted, you have absolutely no guarantee that the page your server sends is going to be the page that your visitor receives!

If an attacker controls a link in the chain of computers between you and the site you’re visiting it is trivial to modify requests to and from a visitor and the server. A slightly more sophisticated attacker can perform these attacks without the need to control a link in the chain, a so called “Man on the side” attack – more technically complex, but still relatively trivial with sufficient budget, and has been widely automated by state actors and criminal gangs.

The capabilities these sorts of attacks give someone are limited only by budget and imagination. On a sliding scale of evil, possibly the least evil use we’ve seen in the wild is probably the advertising injection attack used by certain ISPs and Airplane/hotel wifi providers, but could easily extend to attacks designed to actively compromise your security.

Consider this example of an attack exploiting a common configuration:

  • A web application is installed on a server, and the site is available by visiting both HTTP and HTTPS endpoints. That is to say, if you visited both http://foo.com and https://foo.com, you’d get the same page being served.
  • Login details are sent using a POST form, but because the developers aren’t complete idiots they send these over HTTPS.

Seems reasonable, and I used to do this myself without seeing anything wrong with it.

However, consider what an attacker could do in this situation if the page serving the form is unencrypted. It would, for example, be a relatively trivial matter, once the infrastructure is in place, to simply rewrite “https://” to “http://”, meaning your login details would be sent unencrypted. Even if the server was configured to only accept login details on a secure connection (another fairly common practice), this attack would still work since the POST will still go ahead. A really sophisticated attacker could intercept the returning error page, and replace it with something innocuous, meaning your visitor would be non the wiser.

It gets worse of course, since as we have learnt from the Snowden disclosures, security agencies around the world will often illegally conscript unencrypted web pages to perform automated attacks on anyone they view as a target (which, as we’ve also learnt from the Snowden disclosures, includes just about everybody, including system administrators, software developers and even people who have visited CNN.com).

Lets Encrypt!

Encrypting your website is fairly straightforward, certainly when compared to the amount of work it took to deploy your web app in the first place. Plus, with the new Lets Encrypt project launching soon, it’s about to get a whole lot easier.

You’ll need to make sure you test those configurations regularly, since configuration recommendations change from time to time, and most shared hosts & default server configurations often leave a lot to be desired.

However, I assert that it is worth the extra effort.

By enabling HTTPS on the entire site, you’ll make it much much harder for an attacker to compromise your visitor’s privacy and security (I say harder, not impossible. There are still attacks that can be performed, especially if the attacker has root certificate access for certificates trusted by the computer you’re using… so be careful doing internet banking on a corporate network, or in China).

You also add to the herd immunity to your fellow internet users, normalising encrypted traffic and limiting the attack surface for mass surveillance and mass packet injection.

Finally, you’ll get a SEO advantage, since Google is now giving a ranking boost to secure pages.

So, why wait?

So yesterday, we were greeted with another bombshell from the Snowden archives.

After finding out the day before that GCHQ had spied on lawyers, we now find out that GCHQ and the NSA conspired to steal the encryption keys to pretty much every sim card in the world, meaning that they can easily break the (admittedly weak) encryption used to protect your phonecalls and text messages.

Personally, I’m not terribly concerned about this, because the idea that your mobile phone is insecure is hardly news. What is of concern to me, is how they went about getting those keys.

It seems that in order to get these keys, the intelligence agencies hunted down and placed under invasive surveillance ordinary innocent people, who just happened to be employed by or have dealings with the companies they were interested in.

The full capabilities of the global surveillance architecture they command was deployed against entirely unremarkable and innocent individuals. People like you and me, who’s entire private lives were sifted through, just in case they exposed some information that could be used against the companies which they worked.

Nothing to hide, nothing to fear

If there is a silver lining in all this, with any luck it will go some way towards shattering the idea that because you have nothing to hide, you have nothing to fear.

This is, primarily, a coping strategy. It’s a lie people tell themselves so they can avoid confronting an awkward or terrifying fact, a bit like saying climate change isn’t real, or that smoking won’t kill me.

Generally, it is taken to mean that you’ve done nothing wrong, i.e. illegal (and of course, that’s not what privacy is about, and what you consider being “wrong” has typically not been the same as what those in power consider “wrong”).

Fundamentally, it misses the point that you don’t get to decide what others are going to find interesting, or suspect you of knowing. In this instance, innocent people had their privacy invaded purely because they had suspected access to information that the intelligence agencies found interesting. This is something that, were I to do something similar, I’d go to jail for a very long time.

Now consider that one of the NSA’s core missions is to advance US economic interests, spying on Brazilian oil companies and European trade negotiations, etc. If I worked at a competitor of a US company, I’d be very careful what I said in any insecure form of communication.

You do have something to hide.

The majority of web servers retain a vast amount of data about their visitors in the form of log files. Other processes running on the server, like the system log, MTA log, etc, also store a raft of information.

These logs are typically retained (although often rotated at regular intervals to save space) basically until the admin is looking to reclaim some disk space or the server is reinstalled, so, from a practical standpoint that’s “forever”. This is very much part of the tech industry’s dataholic “collect everything” culture, which I’m personally trying to wean myself off of.

Thing is, at first glance, retention seems like such a good idea (and limited retention can be, more on that later). You need logs to find out how your server is performing, and what if something goes wrong? However, they’re mostly just noise, and they go stale very quickly… when was the last time you needed to look at a 4 month old apache log file?

The reality is that the vast majority of the time you’re only really interested in the last couple of lines. Why keep the rest?

What question are you trying to answer?

Log files have there use; they are invaluable to diagnose specific and immediate problems along the lines of “My web site keeps giving me a white page!”, or “Why on earth won’t my firewall start?”, or “What was the last thing Apache did before it crashed?”.

However, to answer the perhaps more useful questions like, “Am I seeing increased traffic?” or “Are my hard drives healthy?”, or even esoteric questions like, “Did spring cleaning my server save me money?“, your raw logs really aren’t going to be much use to you.

To answer the questions you’re really interested in, you’re going to have to cook this data into something tasty.

What I do…

This is the approach I’m currently using for myself, and which I been recommending to my clients. Obviously you need to adjust this based on specific requirements, for example, one client I had in the past had a legal requirement to retain all logs off line (of course nobody ever looked at them but rules is rules).

  1. Retain raw logs for a day: keep your raw logs for a short period of time, this will let you get at the raw text of any error messages should anything on your server die.
  2. Run an infrastructure monitoring tool: instead of keeping raw logs, what you should be keeping is the higher level statistical information that is produced by analysing your logs (and other sources) produced by a tool like munin. These results have all the noise (and any sensitive information) removed, and are far better at helping you diagnose problems.

Using this approach I was in the past able to, among many other things:

  • Spot a failing hard drive on a customer’s server before it became a problem (because over time the frequency of errors on that specific drive was increasing).
  • Optimise caches within a feedback loop (I could track configuration changes with a corresponding increase or decrease in cached pages served).
  • Isolate the cause of an intermittent failure on a client site (by seeing what the server was doing at the time of the outage, I could see that the mysql query cache was becoming full causing queries to run slowly and apache to block).
  • Link an increasing number of errors back to a configuration change made months ago (I had logged the time and date of the config change, and could look back at my graphs to see that I first started seeing problems after this time. Reverted the change and everything was a-ok).
  • …etc…

In each case the information was in the raw logs, but good luck trying to find it.

There are many tools out there that can help you, but the basic principle is the same – process your logs into a more usable statistical form from which it’s easy to gain insights from, and ditch the unnecessary raw logs which are mostly noise.