While plenty of people have had a mouthful (or page full, or pipe full) of things to say about the Amazon outage, the one thing that it brings to the fore is not a problem with cloud, but a problem with storage. Long ago, the default mechanism for “High Availability” was to have two complete copies of something (say a network switch) and when one went down, the other was brought up with the same IP. It is sad to say that even this is far-and-away better than the level of redundancy that most of us place in our storage.
The reasons are pretty straight-forward, you can put in a redundant pair of NAS heads, or a redundant pair of file/directory virtualization appliances like our own ARX, but a redundant pair of all of your storage? The cost alone would be prohibitive. Amazon’s problems seem to stem from a controller problem, not a data redundancy problem, but I’m not an insider, so that is just appearances. But most of us suffer from the opposite. High availability entry points protect data that is all too often a single point of failure. I know I lived through the sudden and irrevocable crashing of an AIX NAS once, and it wasn’t pretty. When the third disk turned up bad, we were out of luck, had to wait for priority shipment of new disks and then do a full restore… The entire time being down in a business where time is money.
The critical importance of the data that is the engine of our enterprises these days makes building that cost-prohibitive truly redundant architecture a necessity. If you don’t already have a complete replica of your data somewhere, it is worth looking into file and database replication technologies. Honestly, if you choose to keep your replicas on cheaper, slower disk, you can save a bundle and still have the security that even if your entire storage system goes down, you’ll have the ability to keep the business running.
But what I’d like to see is full blown storage as a service. We couldn’t call it SaaS, so I’ll propose we name it Storage Tiers As A Service Host, just so we can use the acronym Staash. The worthy goal of this technology would be the ability to automatically, with no administrator interaction, redirect all traffic to device A over to device B, heterogeneously. So your core datacenter NAS goes down hard, lets call it a power failure to one or more racks, Staash would detect that the primary is off-line and substitute your secondary for it in the storage hierarchy. People might notice that files are served up more slowly, depending upon your configuration, but they’ll also still be working. Given sufficient maturity, this model could even go so far as to allow them to save changes made to documents that were open at the time that the primary NAS went down, though this would be a future iteration of the concept.
Today we have automated redundancy all the way to the final step, it is high time we implemented redundancy on that last little bit, and made our storage more agile. While I could reasonably argue that a File/Directory Virtualization device like F5’s ARX is the perfect place to implement this functionality – it is already heterogeneous, it sits between users and data, and it is capable of being deployed in HA pairs… All the pre-requisites for Staash to be implemented – I don’t think your average storage or server administrator much cares where it is implemented, as long as it is implemented.
We’re about 90% there. You can replicate your data – and you can replicate it heterogeneously. You can set up an HA pair of NAS heads (if you are a single-vendor shop) or File/Directory virtualization devices whether you are single-vendor or heterogeneous, and with a file/directory virtualization tool you have already abstracted the user from the physical storage location in a very IT-friendly way (files are still saved together, storage is managed in a reasonable manner, only files with naming conflicts are renamed, etc), all that is left is to auto-switch from your high-end primary to a replica created however your organization does these things…
And then you are truly redundantly designed. It’s been what, forty years? That’s almost as long as I’ve been alive.
Of course, I think this would fit in well with my long-term vision of protocol independence too, but sometimes I want to pack too much into one idea or one blog, so I’ll leave it with “let’s start implementing our storage architecture like we do our server architecture… No single point of failure. No doubt someone out there is working on this configuration… Here’s hoping they call it Staash when it comes out.
The cat in the picture above is Jennifer Leggio’s kitty Clarabelle. Thanks to Jen for letting me use the pic!