NoSQL is the name given to a collection of newer database storage systems, which, among other things, don’t require a database schema to be defined ahead of time. They have become increasingly popular in recent years, and a large part of the reason is that they offer a number of significant scalability advantages over traditional relational database systems, especially when deployed in modern distributed web architectures.

When Elgg was coded, all those years ago, the standard web application environment was LAMP, where the M of course meant MySQL. This was fine for the time, but things have moved on, and I have been getting an increasing number of queries from people asking me how they might go about migrating Elgg over to NoSQL, so I thought it’d be worth writing up some of my thoughts on the subject.

I caveat all of this heavily by saying that, whatever you do, migrating Elgg over to NoSQL is going to be a big job, and additionally I’ve not actually tried to do it (and I’m not likely to, unless someone persuades me). However, the following should give you a place to start…

The Object Model

The good news is that Elgg’s object model, together with it’s key -> value metadata system, is actually pretty well suited to NoSQL. Additionally, the fact that every entity in Elgg has a globally unique identifier, which can canonically identify an object, means that you should run into fewer issues when you come to scale.

Obtaining this guid (and in fact any identifier – metadata ids, annotation ids etc etc) presents you with your first major issue.

Currently, Elgg uses the MySQL’s auto_increment value in the table. This was simple, and writing a table and receiving the ID is an atomic operation, meaning you don’t have to lock the table or do any other fancy stuff to ensure that the ID you receive is the correct ID for the record you’ve just written. It does however introduce a limit in how much you can scale out, since you always must have one canonical write database in order to get IDs that are unique globally throughout the system.

Were I to write Elgg today, I would not have done it this way.

A starting point to addressing this issue would be to look at using something like Twitter Snowflake. Snowflake is a server process which returns algorithmically generated identifiers which are unique, and incremental over time. How important this is in practice is up for debate, since most native operations base sort on a separate time_created field.

One assumption that is made quite widely throughout Elgg (and also a fare few plugins), however, is that GUIDs are integer values. There’s no getting around that this going to cause a fair amount of pain.

Objects and Functions

Once the data model has been migrated over to NoSQL, you’re going to have to modify the Elgg core database retrieval functions.

For the really low level get_entity() method, and similar functions which return individual records, this should be fairly straightforward. For the more involved get_entities*, you’re going to have to get a little bit more creative, especially since 1.8, Elgg allows you to specify custom JOIN and WHERE clauses, so these are going to have to be remapped.

It is possible that there are some libraries or DB front end layers available to simplify this process significantly, but I’m not currently aware of any.

Plugins

Migration of plugins is going to either be really easy, or really hard, depending on how they’ve been written. If they are using core Elgg function and are not making too many assumptions, you should in theory be able to virtually drop them in and hit go (after maybe changing any occurrence of the plugin casting GUIDs to an integer, if you’re using Snowflake).

Plugins which make their own DB queries (there shouldn’t be any, but those that are around are fewer in number) will obviously cause you a bit of a headache.

Anyway, those are my first thoughts on the matter. I’d be interested to hear from anybody who’s tried this!