If Amazon’s Availability Zone strategy had worked as advertised its outage would have been non-news. But then again, no one really knows what was advertised…
Availability Zones are distinct locations that are designed to be insulated from failures in other zones. This allows you to protect your applications from possible failures in a single location. Availability Zones also provide inexpensive, low latency network connectivity to other Availability Zones in the same Region.
Now, it’s been argued that part of the reason Amazon’s outage was so impactful was a lack of meaningful communicati... on the part of Amazon regarding the outage. And Amazon has admitted to failing in that regard and has promised to do better in the future. But just as impactful is likely the lack of communication before the outage; before deployment. After all, the aforementioned description of Availability Zones, upon which many of the impacted customers were relying to maintain availability, does not provide enough information to understand how Availability Zones work, nor how they are architected.
Turns out that didn’t work out so well but more disconcerting is that there is still no explanation regarding what kind of failure – or conditions – result in a fail over from one Availability Zone to another. Global Application Delivery Services (the technology formerly known as Global Server load balancing) is a similar implementation generally found in largish organizations. Global application delivery can be configured to “fail over” in a variety of ways, based on operational or business requirements, with the definition of “failure” being anything from “OMG the data center roof fell in and crushed the whole rack!” to “Hey, this application is running a bit slower than it should, can you redirect folks on the east coast to another location? Kthanxbai.” It allows failover in the event of failure and redirection based on failure to meet operational and business goals. It’s flexible and allows the organization to define “failure” based on its needs and on its terms. But no such information appears to exist for Amazon’s Availability Zones and we are left to surmise that it’s like based on something more akin to the former rather than the latter given their rudimentary ELB (Elastic Load Balancing) capabilities.
When organizations architect and subsequently implement disaster recovery or high-availability initiatives – and make no mistake, that’s what using multiple Availability Zones on Amazon is really about – they understand the underlying mechanism and triggers that cause a “failover” from one location to another. This is the critical piece of information, of knowledge, that’s missing. In an enterprise-grade high-availability architecture it is often the case that such triggers are specified by both business and operational requirements and may include performance degradation as well as complete failure. Such triggers may be based on a percentage of available resources, or other similar resource capacity constraints. Within an Amazon Availability Zone apparently the trigger is a “failure”, but a failure of what is left to our imagination.
But also apparently missing was a critical step in any disaster recovery/high availability plan: testing. Enterprise IT not only designs and implements architectures based on reacting to a failure, but actually tests that plan. It’s often the case that such plans are tested on a yearly basis, just to ensure all the moving parts still work as expected. Relying on Amazon – or any cloud computing environment in which resources are shared – makes it very difficult to test such a plan. After all, in order to test failover from one Availability Zone to another Amazon would have to forcibly bring down an Availability Zone – and every application deployed within it. Consider how disruptive that process might be if customers started demanding such tests on their schedule. Obviously this is not conducive to Amazon maintaining its end of the uptime bargain for those customers not deployed in multiple Availability Zones.
Interestingly enough, a 2011 IDG report on global cloud computing adoption shows that high performance (availability and reliability) are the most important – 5% higher than security. Amazon’s epic failure will certainly do little to alleviate concerns that public cloud computing is not ready for mission critical applications.
Now, it’s been posited that customers were at fault for trusting Amazon in the first place. Others laid the blame solely on the shoulders of Amazon. Both are to blame, but Amazon gets to shoulder a higher share of that blame if for no other reason than it failed to provide the information necessary for customers to make an informed choice regarding whether or not to trust their implementation. This has been a failing of Amazon’s since it first began offering cloud computing services – it has consistently failed to offer the details necessary for customers to understand how basic processes are implemented within its environment. And with no ability to test failover across Availability Zones, organizations are at fault for trusting a technology without understanding how it works. What’s worse, many simply didn’t even care – until last week.
Now it may be the case that Amazon is more than willing to detail such information to customers; that it has adopted a “need to know” policy regarding its architecture and implementation and its definition of “failure.” If that is the case, then it behooves customers to ask before signing on the dotted line. Because customers do need to know the details to ensure they are comfortable with the level of reliability and high-availability being offered. If that’s not the case or customers are not satisfied with the answers, then it behooves them to – as has been suggested quite harshly by many cloud pundits – implement alternative plans that involve more than one provider.
A massive failure on the part of a public cloud computing provider was bound to happen eventually. If not Amazon than Azure, if not Azure then it would have been someone else. What’s important now is take stock and learn from the experience – both providers and customers – such that a similar event does not result in the same epic failure again. Two key points stand out:
And if you can’t test it but need to guarantee high uptime and can’t get the details necessary to understand – and trust – the implementation to provide that uptime, perhaps it’s not the right choice in the first place.
Or perhaps you just need to demand an explanation.