Sunday 3 March 2013

Make sure your SAN/storage is redundant!

The company I work at learned a very hard lesson this week.  We had a large number of systems down for more than 24 hours because our main SAN failed.  I am not going to name the manufacturer, because that misses the point.  Even though the SAN was expensive, highly redundant and normally extremely reliable, it failed and it took over 24 hours to fix it (yes, we have a really good support plan from the vendor, otherwise we probably would have been down even longer).  Many of our databases relied on that SAN, as did some of our NFS stuff. I'm not sure that everyone appreciated how dependent we were on that SAN, although there may not have been too much anxiety over it because it had worked flawlessly for many years.

This may be obvious to some (and it is to us in retrospect), but even if you have redundant databases, they are not really (totally) redundant if they are on the same SAN. Some very large database manufacturers advertise clustering products that assume that all the nodes are hooked to the same storage.  Obviously, this may not get you the fault tolerance you desire, if your SAN decides to fail.  When I did my Masters degree, my academic supervisor once gently chastised me for saying that a certain type of computer systems failure was very infrequent and shouldn't be worried about.  His point was that infrequent/improbable failures are the worst kind because no-one is generally prepared to deal with them.

The problem with trying to design fully redundant systems is that it is easy to miss single points of failure like SANs, because architects don't always concern themselves with where the storage is situated (certainly developers almost never do; but perhaps a devops team might) and sometimes operations teams will need to move storage around to balance the load over several SANs. Operations teams may not be aware that server 1 and server 2 host redundant copies of the same database and therefore should not be sharing a SAN. Even more insidious is NFS/network volumes.  If major parts of your infrastructure have NFS/network volumes mounted from the same server or SAN, then you could be in for a nasty surprise.

You have to wonder if you should always have at least two data centres that are widely separated geographically.  The advantage of this is that it encourages the systems in the data centres to be mostly independent of each other (although I guess it doesn't necessarily guarantee it).  Perhaps having a redundant set of systems in the cloud is another way to do this for those who can't afford another data centre.  I'm not sure that putting all your systems in the (same) cloud is necessarily going to be a perfect solution either though.  Clouds have a tendency to be one large software/hardware system that are prone to bugs taking out large proportions of the systems hosted in them.  I guess, if you want to go the "everything in the cloud" route, it may make sense to use clouds from at least two different providers.

If you are storing data for websites, NoSQL databases like Solr and Cassandra are attractive, because both can be clustered to avoid downtime due to node failures and run entirely off local disk. We have Cassandra and Solr clusters that ran just fine through our SAN crisis, and hopefully I can convince our development teams to use them more extensively in the future.

No comments:

Post a Comment