Saturday 17 December 2016

HBase vs Cassandra: A pointless comparison?


It's easy to find articles comparing HBase and Cassandra and offering an opinion on which is better.  There is, not surprisingly, no widespread agreement on this. Let me declare my bias/history up front: I've had extensive experience with Cassandra and some experience with HBase.  Most recently, I have been architecting a system that will likely use HBase, but I suppose could use Cassandra as well.

I won't make you wait until the end of this blog post to get my take on which is better.  However, you may not like my conclusion: it really depends on many things, especially on your existing architecture standards and whether you are running one of the major Hadoop distros (such as Cloudera or Hortonworks).

In short, if you are integrating with an existing Hadoop distribution, you probably want to use HBase, given that it is part of Hadoop. If you are not yet convinced of Hadoop's value, and want to experiment with Bigtable-like databases and perhaps distributed frameworks like spark, you may wish to consider Cassandra instead, because it is still (in my opinion), somewhat easier to get up and running, if you haven't already installed Hadoop. You can run Hadoop on top of Cassandra if you want -- however, you should probably look at buying DataStax Enterprise for this.  This may not be an option if you are a startup fueled by open source, however DataStax at one point did have a special program for startups, which could make DataStax Enterprise a viable alternative.

Another consideration is the skill set of your developers.  Cassandra is a bit easier for some developers to wrap their minds around because of CQL, an SQL-like language for queries and updates (DML) that is now a standard part of it.  CQL also has basic support for collections (lists and sets) which can be useful, as Redis has shown. I have been surprised by the difference that CQL makes to some people.  It definitely shortens the learning curve. Personally, I find the HBase put/get java api to be quite adequate and to be fair, it is possible to to get SQL-like queries for HBase using Phoenix, Impala or Hive, although Hive queries are not quite real time (which may be fine, depending on your application). For some, Cassandra will have a slight edge here because CQL comes with a default Cassandra install.

Neither Cassandra nor HBase has ACID (Atomicity, Consistency, Isolation and Durability) transactions, which is a deal breaker for many developers and architects. Both have some form of row-level atomicity and isolation, which is sometimes enough, because one tends to de-normalize heavily when using them, and often updates that would be spread over multiple tables in a relational database are written to a single row in Cassandra and HBase. Cassandra has something called "light weight" transactions, which also appear to be on the way for HBase.  These haven't been all that useful to me personally, but others may differ. There are various techniques to simulate atomic transactions that span multiple rows, should you need to do so, although I'm not aware of how you would handle isolation in this case.

Both Cassandra and HBase are "closer to the metal" than relational databases. One absolutely needs to work backwards from the desired queries in order to design a proper data model (and knowledge of how they store data in memory and on disk can be extremely helpful). Adding a new query may force you to change your data model or at very least, add a new "table" to which you will need to duplicate the data that you are already writing elsewhere.  As a result, good, layered design of your code is quite necessary. You will also need to consider the layout of your data on disk and/or how often your data gets flushed to disk.  It is often very useful to add more nodes to ensure that your important data stays in memory until it is no longer frequently accessed.  Reading data and then immediately updating is frowned upon (especially in Cassandra), so sometimes you need to work around this need.

To some extent, both Cassandra and HBase are data stores that try to maximize horizontal scalability, concurrency and reliability (through enabling redundant servers which have redundant copies of data).  Relational database features (such as joins and ACID transactions) that make either of these things hard are generally missing or their use is discouraged. Quite often, your need for these things will be the determining factor of whether you choose Cassandra or HBase as opposed to deciding to use a relational database such as MySQL, PostgreSQL, Oracle, MSSQL, etc. I would argue that the results of this choice will probably have more impact than the choice between Cassandra or HBase.