Monday, 18 February 2013

NoSQL part two: Cassandra

Having tried MongoDB, we definitely saw some promise in NoSQL (or non-relational) databases, even if Mongo wasn't quite the solution for us.  I asked our technical team to suggest other NoSQL databases that we should try.  Most of them were fairly silentl, unfortunately.  However, one of our web admins thought Cassandra was something we should look at. Cassandra was either at version 0.5 or 0.6 at the time (I can't remember which).

One of our senior developers/architects had a closer look and thought that it was worth doing a POC.  We decided to try and use Cassandra for our election night postal code lookup application.  This is a feature on our website in which a user types in their postal code (or a prefix of it) and the website returns the electoral district(s) for the postal code or prefix, possibly with live results.

In retrospect, this wasn't the absolute best use for Cassandra and we implemented it in a very simple way:  essentially we used Cassandra as a key-value store, in which the keys were postal codes or postal code prefixes and the values were the electoral district number. More specifically, we created a Cassandra column family (which is the analogue to a table in a relational database) whose row keys were postal codes or prefixes and in which each row contained a single column with a text blob that had a list of electoral districts.

Our somewhat naive Cassandra column family for  electoral districts.
The row key is a postal code or postal code prefix
Each row has one column with a list of electoral districts that correspond to
the postal code or postal code prefix.


We probably should have put all the postal codes into columns in a single row and used column slices, which are essentially range queries on the columns in a single row (more on this in later blog entries). Since Cassandra supports 2 billion columns per row, this would have worked just fine and would have saved the need to handle postal code prefixes specially, because we could handle a prefix by searching for a range of postal codes.  The Cassandra low level data architecture can be a little difficult to grasp at first and choosing a somewhat inelegant solution was a consequence of this.

A more optimal Cassandra column family for electoral districts ( only one row and not all columns are shown).
The key idea is that there is a row for each of the 26 first letters of a postal code (A-Z).
Each row has column names that are all the postal codes that start with that letter.
We can easily look up an individual postal code by retrieving its column, or we can
get all the columns that correspond to a prefix by doing a column slice, which is Cassandra's
term for a column range scan.


We weren't enormously impressed by Cassandra at this point, however it did appear it might be useful as key-value store.  We became more impressed when we started doing load testing, which is extremely important as we tend to get very heavy traffic on election nights. When our QA team started doing their load test, the people watching the Cassandra servers saw almost no load on our two node Cassandra cluster and thought that something had delayed the start of the test. The QA team subsequently increased the intensity of their load test and continued to increase it, until the load testing software failed.  We couldn't help but be impressed by the fact that our two-node Cassandra cluster could withstand more load than our load testing arrangement could generate.

We ran our Cassandra POC in production on election night and it performed well.  As as result, we decided to look into Cassandra in more detail and try using it for something else.  The next application was live stock quotes, a fairly classic time series problem for which we had shown our relational database to be poorly suited.  More on that in the next blog entry.


Sunday, 10 February 2013

Our "NoSQL experience", part one: MongoDB and Lucene

A few years ago, we decided to try out some "NoSQL" databases. Up until this point, we had mostly used Oracle, along with a little bit of mysql. Our experience with Oracle was that it mostly worked fine, however we needed to buy bigger and more expensive servers as we built out our website and our traffic increased. Mysql was tried as a low (or essentially no) cost alternative to Oracle.  We found that it definitely worked, but wasn't quite as fast or reliable as Oracle (this was more than five years ago).

So-called "NoSQL" databases were becoming popular and it seemed like something we should try. The most popular one at the time was MongoDB, so we made this our first choice.

We always like to try a "proof of concept" implementation before we commit to a technology.  In this case, the application was a sort of "auto complete" for our website that, given a few keystrokes, suggested a stock symbol or mutual fund from a database of about 15,000 North American stocks or about 18,000 Canadian mutual funds.  To this day, I am not really sure that this was the best application for MongoDB.  However, in our initial tests, it appeared to be fine. I'm going to dig up the actual performance figures that we got and add them to this blog entry once I find them

Unfortunately, we had a lot of issues when we moved it into production on a two-node cluster.  We seemed to lose data on some of our updates, our performance was very poor under real loads and the application appeared to be unstable.  We did some research and found out that Mongo wasn't really durable, in that writes were acknowledged before they were actually written to disk.  In retrospect, I think we also failed to do proper error checking.  At this point, we started researching our problems and discovered a lot of disconcerting articles (a modern, although somewhat biased example of these articles would be this blog entry).

We were faced with a production application that was failing badly, which is a nightmare in which you generally look for the fastest possible solution.  In this case, it turned out that our developer team had a plan B -- an implementation of our auto-complete feature using lucene. We did some quick performance tests and found the lucene version to be much faster and more reliable.  So, we put it into production as fast as our promotion procedures allowed, and shut down the mongo version.

I don't want this blog entry to descend into MongoDB bashing.  There is already a lot of that out there. To this day, I am not convinced that we chose an appropriate proof of concept application for MongoDB.  However, we did choose an excellent one for lucene :-). What concerned us most about Mongo was that it did not seem to be durable, and we knew that global locking (a limitation which has apparently been fixed) would be an issue for many of our applications.  We came to the collective conclusion that MongoDB simply wasn't for us.  To be fair, a number of organizations have reached the exact opposite conclusion, and, if you are considering MongoDB, you should probably go to the 10gen website and read some of the success stories.  If you are interested in a more cynical (but very funny) take, try this video.

We continue to use the lucene solution to this day and it has worked well for us. In fact, we are now extensively using solr (which is based on lucene but adds a convenient clustering and web services api around it).  It turns out that solr is sort of a "NoSQL" solution in its own right, which is something I will cover in future blog entries.

Having decided to move on from MongoDB, we started looking at other "NoSQL" solutions.  More on that in future blog entries.

Sunday, 3 February 2013

What's so ethereal about computer architecture

What's so ethereal about computer archtecture?  Perhaps nothing.  I fully admit that I chose this name because I needed to get a unique name that wasn't already taken in blogger.

By way of introduction, I have been doing computer architecture professionally for about ten years now.  I've done mostly software architecture for web and SAP systems (a bizarre combination, I know) as well as set hardware standards (a lot more controversial than you might think).

Along the way, I've seen language wars, database wars and tried doing some fairly serious things with NoSQL databases (which will probably end up being a topic of discussion here).

My background is fairly hardcore computer science, including doing graduate work in the field. Lately, I've had to learn about communicating with people with non-technical backgrounds, which is something I definitely find challenging and may write about.

I definitely have some biases:  For example, I used to program professionally (in that I got paid for it) in assembly language. For me, C (and also C++) is a really high level language.  When java came along, it seemed luxuriously high level.  I have very little issue with the "boiler plate" code that many complain of, but was always a big fan of functional programming. So, I suppose I am less interested in Ruby and Python and more interested in languages like Scala and Clojure as alternatives to java and C.  I am not a big fan of the call back model of programming (I used it a lot in the, ahem, 80's) and so I don't really get why Node.js is a big deal.

The company I am currently working at has been trying to use agile development methodologies (scrum and kanban) with varying degrees of success.  I've always been a big fan of iterative methodologies, but it is becoming clear that having good people probably trumps having the proper methodology.

We've also been having some issues with managing concurrent development.  That might make for some interesting blog entries.  Essentially, its interesting how easy modern source control systems (like Git and its cousins) make branching.  I'm not convinced though that branching is always the best idea.  It probably is not substitute for cooperation amongst developers.

Anyway, enough for now...  It will be fun to see if anyone actually reads this :-).