NoSQL is the name given to a collection of newer database storage systems, which, among other things, don’t require a database schema to be defined ahead of time. They have become increasingly popular in recent years, and a large part of the reason is that they offer a number of significant scalability advantages over traditional relational database systems, especially when deployed in modern distributed web architectures.

When Elgg was coded, all those years ago, the standard web application environment was LAMP, where the M of course meant MySQL. This was fine for the time, but things have moved on, and I have been getting an increasing number of queries from people asking me how they might go about migrating Elgg over to NoSQL, so I thought it’d be worth writing up some of my thoughts on the subject.

I caveat all of this heavily by saying that, whatever you do, migrating Elgg over to NoSQL is going to be a big job, and additionally I’ve not actually tried to do it (and I’m not likely to, unless someone persuades me). However, the following should give you a place to start…

The Object Model

The good news is that Elgg’s object model, together with it’s key -> value metadata system, is actually pretty well suited to NoSQL. Additionally, the fact that every entity in Elgg has a globally unique identifier, which can canonically identify an object, means that you should run into fewer issues when you come to scale.

Obtaining this guid (and in fact any identifier – metadata ids, annotation ids etc etc) presents you with your first major issue.

Currently, Elgg uses the MySQL’s auto_increment value in the table. This was simple, and writing a table and receiving the ID is an atomic operation, meaning you don’t have to lock the table or do any other fancy stuff to ensure that the ID you receive is the correct ID for the record you’ve just written. It does however introduce a limit in how much you can scale out, since you always must have one canonical write database in order to get IDs that are unique globally throughout the system.

Were I to write Elgg today, I would not have done it this way.

A starting point to addressing this issue would be to look at using something like Twitter Snowflake. Snowflake is a server process which returns algorithmically generated identifiers which are unique, and incremental over time. How important this is in practice is up for debate, since most native operations base sort on a separate time_created field.

One assumption that is made quite widely throughout Elgg (and also a fare few plugins), however, is that GUIDs are integer values. There’s no getting around that this going to cause a fair amount of pain.

Objects and Functions

Once the data model has been migrated over to NoSQL, you’re going to have to modify the Elgg core database retrieval functions.

For the really low level get_entity() method, and similar functions which return individual records, this should be fairly straightforward. For the more involved get_entities*, you’re going to have to get a little bit more creative, especially since 1.8, Elgg allows you to specify custom JOIN and WHERE clauses, so these are going to have to be remapped.

It is possible that there are some libraries or DB front end layers available to simplify this process significantly, but I’m not currently aware of any.

Plugins

Migration of plugins is going to either be really easy, or really hard, depending on how they’ve been written. If they are using core Elgg function and are not making too many assumptions, you should in theory be able to virtually drop them in and hit go (after maybe changing any occurrence of the plugin casting GUIDs to an integer, if you’re using Snowflake).

Plugins which make their own DB queries (there shouldn’t be any, but those that are around are fewer in number) will obviously cause you a bit of a headache.

Anyway, those are my first thoughts on the matter. I’d be interested to hear from anybody who’s tried this!

If you run a web server and you take a look at the logs, you will likely have seen something like this appearing:

xxx.xxx.xxx.xxx - - [01/May/2013:18:32:36 +0100] "GET /w00tw00t.at.ISC.SANS.DFind:)
HTTP/1.1" 400 320 "-" "-"

This is the product of a tool called w00tw00t, which is used by nefarious script kiddies to probe and attempt to compromise your server. If your server is up-to-date, there is probably not too much to worry about (without wanting to jinx it), however since a strength in depth approach is always the best plan when it comes to security, it is probably a good idea if we could deploy some additional countermeasures.

A common tactic is to use a tool like fail2ban to monitor your logs and then firewall off the offenders IP address, and there are filters out there to do this.

However, like a lot of people, I use the caching proxy Squid, in reverse proxy mode, to help handle high load on a web server. Since, in this configuration, Apache (and therefore fail2ban’s standard w00tw00t rules) will see these requests as coming from the cache machine, we need to take another approach.

One option is to modify the Apache log format to use the X-Forwarded-For variable instead (details of how can be found here), and thus preserving the original IP address in the logs. However, this would require me to modify a number of vhosts, and it seemed simpler to monitor the one squid access log.

So, I wrote a quick fail2ban filter to catch w00tw00t scans and block the offending IP address.

In jail.local

[squid-w00tw00t]
enabled = true
filter = squid-w00tw00t
port = all
logpath = /var/log/squid/access.log
maxretry = 1

In filter.d/squid-w00tw00t.conf

# Fail2Ban configuration file to catch w00tw00t scans on squid reverse proxy settings
#
# Author: Marcus Povey
#

[INCLUDES]

# Read common prefixes. If any customizations available -- read them from
# common.local
before = common.conf

[Definition]

_daemon = squid

# Option: failregex
# Notes.: regex to match the password failures messages in the logfile. The
# host must be matched by a group named "host". The tag "" can
# be used for standard IP/hostname matching and is only an alias for
# (?:::f{4,6}:)?(?P<host>[\w\-.^_]+)
# Values: TEXT
#
failregex = <HOST> TCP_.*http.*/(w00tw00t|wootwoot|WootWoot|WooTWooT).*$

# Option: ignoreregex
# Notes.: regex to ignore. If this regex matches, the line is ignored.
# Values: TEXT
#
ignoreregex =

» Visit the project on Github…

DNS is the system which converts a human readable address, like www.google.com, into the IP address that the computer actually uses to route your connection through the internet, e.g. 173.194.34.179.

This works very well, however, it is a clear text protocol. So, even if all other traffic from your computer is encrypted (for example, by routing your outbound traffic through a VPN – more on that later), you may still be “leaking” your browsing activity to others on your network.

Since I intend to do my best to stamp out cleartext wherever it may be, this is a problem for me.

Encrypting DNS

Unfortunately, DNS is still very much a legacy technology as far as modern security practices are concerned, and does not natively support encryption. Fortunately, OpenDNS, a distributed DNS alternative, have provided DNSCrypt, which is open source, and will encrypt dns connections between your computer and their servers.

DNSCrypt will help protect your browsing from being snooped on, however, you should be aware it’s not foolproof; while people on the same WIFI hotspot / your ISP will not be able especially if you see a lot of error (broken trust chain) resolving ... messages in your system log and your connection stops working when forwarding upstream.to see the clear text of the DNS resolution flash by, once it’s resolved into an IP address, they will still see the outbound connection. So, while they won’t see www.google.com in their capture logs, they will still see that you made a connection to 173.194.34.179, which an attacker can resolve back into www.google.com if they have the motivation. To protect against this, you must deploy this technology along side a VPN of some sort, which will encrypt the whole communication, at least until the VPN outputs onto the internet proper.

All that said, I’ve got it turned on on my home network (since there’s no sense in making an attacker’s life easy), and I’ve got it running on my laptop to give me extra protection against snooping while surfing on public wifi, and in the case of my laptop, I also surf over a VPN.

Setting it up

By far and away the easiest thing to do is use DNSCrypt-proxy, which serves as a drop in replacement for your normal DNS server provided by your ISP. Run it on your local machine, configure your network settings to talk to 127.0.0.1, and you’re done.

In my home network, I had an additional complication, in that I run my own DNS server, which provides DNS names for various human readable names for computers and devices around the home (my computers, the NAS, the printers and so on). I wanted to preserve these, and then configure the DNS server to relay anything that wasn’t local (or cached) via the encrypted link. To accomplish this, I needed to run DNSProxy on the network DNS machine along side BIND (the traditional DNS server software), but listening on a different port.

dnscrypt-proxy --local-address=127.0.0.1:5553 --daemonize

Over on github, my fork of the project contains a Debian /etc/init.d startup script which starts the proxy up in this configuration. You may find this useful.

Then, all I’d need to do is configure BIND to use the dns proxy as a forwarder, and I should be done.

In /etc/bind/named.conf.options:

forwarders {
    127.0.0.1 port 5553;
};

You can use pretty much any port that you like, but don’t be tempted to use something obvious like 5353, since this will cause problems with any Avahi/Bonjour services you may have running.

You may also want to put a blank forwarders section in the zone file for your local domain (which is strictly speaking “correct”, but many examples don’t), e.g.:

zone "example.local" {
    type master;
    notify no;
    file "/etc/bind/db.example.local";
    forwarders { };
};

Some gotchas

First, OpenDNS by default provide “helpful” content filtering, typo correction and a search page for bad domains. This last means that any bad domain will resolve to their web servers on 67.215.65.132, which can break your resolv.conf search domain. This can cause problems in certain situations if, for example, you have subdomains or wildcards in your zone file for your local domain, and will make them only accessible by the fully qualified domain name.

A workaround for this is to create a free account on OpenDNS, register your network, and then disable their web content filtering and typo correction, although my feeling is that I may have made a mistake in the configuration.

Second, OpenDNS’ servers do not support DNSSEC despite promises to the contrary. Not sure why, probably because it would break the DNS hijacking which makes the above unrecognised domain redirection possible. Since their business is security, OpenDNS should be doing DNSSEC validation on your behalf, how much of an issue this is an open question.

Still, it’s worth noting, since you will at least see a lot of error (broken trust chain) resolving ... messages in your system log and in all probability your connection will stop working when forwarding upstream.

Happy encrypting!

Update: CloudNS, an Australian based name service, now offer DNSCrypt together with no logging. There are also a number of OpenNIC servers which are starting to support DNS Encryption, so it’s worth keeping an eye on the Tier2 server page.