NoSQL is the name given to a collection of newer database storage systems, which, among other things, don’t require a database schema to be defined ahead of time. They have become increasingly popular in recent years, and a large part of the reason is that they offer a number of significant scalability advantages over traditional relational database systems, especially when deployed in modern distributed web architectures.

When Elgg was coded, all those years ago, the standard web application environment was LAMP, where the M of course meant MySQL. This was fine for the time, but things have moved on, and I have been getting an increasing number of queries from people asking me how they might go about migrating Elgg over to NoSQL, so I thought it’d be worth writing up some of my thoughts on the subject.

I caveat all of this heavily by saying that, whatever you do, migrating Elgg over to NoSQL is going to be a big job, and additionally I’ve not actually tried to do it (and I’m not likely to, unless someone persuades me). However, the following should give you a place to start…

The Object Model

The good news is that Elgg’s object model, together with it’s key -> value metadata system, is actually pretty well suited to NoSQL. Additionally, the fact that every entity in Elgg has a globally unique identifier, which can canonically identify an object, means that you should run into fewer issues when you come to scale.

Obtaining this guid (and in fact any identifier – metadata ids, annotation ids etc etc) presents you with your first major issue.

Currently, Elgg uses the MySQL’s auto_increment value in the table. This was simple, and writing a table and receiving the ID is an atomic operation, meaning you don’t have to lock the table or do any other fancy stuff to ensure that the ID you receive is the correct ID for the record you’ve just written. It does however introduce a limit in how much you can scale out, since you always must have one canonical write database in order to get IDs that are unique globally throughout the system.

Were I to write Elgg today, I would not have done it this way.

A starting point to addressing this issue would be to look at using something like Twitter Snowflake. Snowflake is a server process which returns algorithmically generated identifiers which are unique, and incremental over time. How important this is in practice is up for debate, since most native operations base sort on a separate time_created field.

One assumption that is made quite widely throughout Elgg (and also a fare few plugins), however, is that GUIDs are integer values. There’s no getting around that this going to cause a fair amount of pain.

Objects and Functions

Once the data model has been migrated over to NoSQL, you’re going to have to modify the Elgg core database retrieval functions.

For the really low level get_entity() method, and similar functions which return individual records, this should be fairly straightforward. For the more involved get_entities*, you’re going to have to get a little bit more creative, especially since 1.8, Elgg allows you to specify custom JOIN and WHERE clauses, so these are going to have to be remapped.

It is possible that there are some libraries or DB front end layers available to simplify this process significantly, but I’m not currently aware of any.

Plugins

Migration of plugins is going to either be really easy, or really hard, depending on how they’ve been written. If they are using core Elgg function and are not making too many assumptions, you should in theory be able to virtually drop them in and hit go (after maybe changing any occurrence of the plugin casting GUIDs to an integer, if you’re using Snowflake).

Plugins which make their own DB queries (there shouldn’t be any, but those that are around are fewer in number) will obviously cause you a bit of a headache.

Anyway, those are my first thoughts on the matter. I’d be interested to hear from anybody who’s tried this!

6 thoughts on “Running Elgg on a NoSQL database

  1. Other benefits include the ability to (finally) include mapped arrays as metadata. It’s a good idea, and would mitigate the single biggest criticism of Elgg. But I think the team has to put all of its eggs into one NoSQL basket, rather than, eg, try and build something that will abstractly map across databases. My money is still on MongoDB, but it’ll be interesting to see what happens when it inevitably does.

  2. I agree for the most part. Elgg was designed with an abstracted db layer so people could theoretically swap out MySQL for Postgres or Oracle, as you know… but as far as I’m aware *nobody* has done this.

    How much of a problem it’s going to be to just lump for Mongo (which has a kind of query structure) or Couch (which has by far the simplest API), or any of the others, is an open question.

    It’s going to be no small task, that’s for sure, and I can’t see it being part of the current Elgg branch. It’ll either be a NoSQL fork, or an Elgg 2.0 thing – the bump in major version number meaning few if any guarantees for backwards compatibility.

  3. I don’t see this happening for Elgg 2.0. It’s an understandable criticism (“You built NoSQL on MySQL! That’s dumb!”), but we’re more concerned about other things at this point like having a sustainable AJAX story (jQuery isn’t going to cut it anymore). How widespread is mongo at this point? One of the benefits of MySQL is that you can just drop Elgg on basically any host and it will run because MySQL is everywhere.

  4. Aye 🙂

    I’m not saying that MySQL is a bad choice (and the fact that it is near ubiquitous is a strong vote in its favour), but back when Elgg was first coded it was basically the only choice (certainly the only choice any of us had any experience with), and now there are other alternatives worth considering… especially since nosql is a better fit for the object model.

    Mongo and Couch (the two I’ve played with any seriousness) are pretty widespread – no less so than mysql – being that the former is a pecl install, and the latter is an apt-get away (on debian and ubuntu at least). Not having a schema also drops the installation process down by one step.

  5. My limited knowledge of NoSQL DBs make me wonder how we would duplicate the kind of complex queries that are regularly done in Elgg (sure, key lookups would be straightforward…). As far as performance/scalability, I think the problem is not the traditional RDBMS but rather an architecture bound to support 5 year old APIs that let you build almost anything. There’s only so much caching you can do when nearly any code can change anything else (views have full PHP code and even get altered elsewhere by plugin hooks).

  6. It’s a different way of thinking that’s for sure, and a invisible drop-in replacement is possibly a little bit ambitious this late in the game. Mongo has some capability to do select like queries, and for other things there’s mapreduce, which works sortof the same.

    A lot of the time the art seems to be making sure you’re storing your data in the best structure, which for things like cassandra (which I know at least one non-trivial Elgg based system has been converted to) means de-normalising and maintaining multiple lookup tables.

    Elgg and Elgg plugins generally re-use the same type of queries over and over, and for the previously mentioned cassandra port, it seems that the vast majority of plugins require little or no modification to deal with the new back end once the core was modified. Some require a little bit more work, but I think most of that is down to time constraints meaning that only the main use-cases were catered for.

    Time will tell how this performs compared to something like an RDS back end.

    You’re totally right about the caching of course; the non-pure mvc nature of elgg means there are limits to what you can cache (although I’ve seen good performance out of both squid reverse proxying and varnish with properly tweaked settings).

    End of the day, it’s a hard problem 🙂

Leave a Reply