Monday 18 February 2013

NoSQL part two: Cassandra

Having tried MongoDB, we definitely saw some promise in NoSQL (or non-relational) databases, even if Mongo wasn't quite the solution for us.  I asked our technical team to suggest other NoSQL databases that we should try.  Most of them were fairly silentl, unfortunately.  However, one of our web admins thought Cassandra was something we should look at. Cassandra was either at version 0.5 or 0.6 at the time (I can't remember which).

One of our senior developers/architects had a closer look and thought that it was worth doing a POC.  We decided to try and use Cassandra for our election night postal code lookup application.  This is a feature on our website in which a user types in their postal code (or a prefix of it) and the website returns the electoral district(s) for the postal code or prefix, possibly with live results.

In retrospect, this wasn't the absolute best use for Cassandra and we implemented it in a very simple way:  essentially we used Cassandra as a key-value store, in which the keys were postal codes or postal code prefixes and the values were the electoral district number. More specifically, we created a Cassandra column family (which is the analogue to a table in a relational database) whose row keys were postal codes or prefixes and in which each row contained a single column with a text blob that had a list of electoral districts.

Our somewhat naive Cassandra column family for  electoral districts.
The row key is a postal code or postal code prefix
Each row has one column with a list of electoral districts that correspond to
the postal code or postal code prefix.


We probably should have put all the postal codes into columns in a single row and used column slices, which are essentially range queries on the columns in a single row (more on this in later blog entries). Since Cassandra supports 2 billion columns per row, this would have worked just fine and would have saved the need to handle postal code prefixes specially, because we could handle a prefix by searching for a range of postal codes.  The Cassandra low level data architecture can be a little difficult to grasp at first and choosing a somewhat inelegant solution was a consequence of this.

A more optimal Cassandra column family for electoral districts ( only one row and not all columns are shown).
The key idea is that there is a row for each of the 26 first letters of a postal code (A-Z).
Each row has column names that are all the postal codes that start with that letter.
We can easily look up an individual postal code by retrieving its column, or we can
get all the columns that correspond to a prefix by doing a column slice, which is Cassandra's
term for a column range scan.


We weren't enormously impressed by Cassandra at this point, however it did appear it might be useful as key-value store.  We became more impressed when we started doing load testing, which is extremely important as we tend to get very heavy traffic on election nights. When our QA team started doing their load test, the people watching the Cassandra servers saw almost no load on our two node Cassandra cluster and thought that something had delayed the start of the test. The QA team subsequently increased the intensity of their load test and continued to increase it, until the load testing software failed.  We couldn't help but be impressed by the fact that our two-node Cassandra cluster could withstand more load than our load testing arrangement could generate.

We ran our Cassandra POC in production on election night and it performed well.  As as result, we decided to look into Cassandra in more detail and try using it for something else.  The next application was live stock quotes, a fairly classic time series problem for which we had shown our relational database to be poorly suited.  More on that in the next blog entry.


No comments:

Post a Comment