When Black Boxes Fail: Amazon, Cloud and the Need to Know

If Amazon’s Availability Zone strategy had worked as advertised its outage would have been non-news. But then again, no one really knows what was advertised…

There’s been a lot said about the Amazon outage and most of it had to do with cloud and, as details came to light, about EBS (Elastic Block Storage). But very little mention was made of what should be obvious: most customers didn’t – and still don’t - know how Availability Zones really work and, more importantly, what triggers a fail over. What’s worse, what triggers a fail back? Amazon’s documentation is light. Very light. Like cloud light.

Availability Zones are distinct locations that are designed to be insulated from failures in other zones. This allows you to protect your applications from possible failures in a single location. Availability Zones also provide inexpensive, low latency network connectivity to other Availability Zones in the same Region.

-- Announcement: New Amazon EC2 Availability Zone in US East

Now, it’s been argued that part of the reason Amazon’s outage was so impactful was a lack of meaningful communication on the part of Amazon regarding the outage. And Amazon has admitted to failing in that regard and has promised to do better in the future. But just as impactful is likely the lack of communication before the outage; before deployment. After all, the aforementioned description of Availability Zones, upon which many of the impacted customers were relying to maintain availability, does not provide enough information to understand how Availability Zones work, nor how they are architected.

TRIGGERING FAIL-OVER

Scrounging around the Internet turns up very little on how Availability Zones work, other than isolating instances from one another and being physically independent for power. In the end, what was promised was a level of isolation that would mitigate the impact of an outage in one zone.

Turns out that didn’t work out so well but more disconcerting is that there is still no explanation regarding what kind of failure – or conditions – result in a fail over from one Availability Zone to another. Global Application Delivery Services (the technology formerly known as Global Server load balancing) is a similar implementation generally found in largish organizations. Global application delivery can be configured to “fail over” in a variety of ways, based on operational or business requirements, with the definition of “failure” being anything from “OMG the data center roof fell in and crushed the whole rack!” to “Hey, this application is running a bit slower than it should, can you redirect folks on the east coast to another location? Kthanxbai.” It allows failover in the event of failure and redirection based on failure to meet operational and business goals. It’s flexible and allows the organization to define “failure” based on its needs and on its terms. But no such information appears to exist for Amazon’s Availability Zones and we are left to surmise that it’s like based on something more akin to the former rather than the latter given their rudimentary ELB (Elastic Load Balancing) capabilities.

When organizations architect and subsequently implement disaster recovery or high-availability initiatives – and make no mistake, that’s what using multiple Availability Zones on Amazon is really about – they understand the underlying mechanism and triggers that cause a “failover” from one location to another. This is the critical piece of information, of knowledge, that’s missing. In an enterprise-grade high-availability architecture it is often the case that such triggers are specified by both business and operational requirements and may include performance degradation as well as complete failure. Such triggers may be based on a percentage of available resources, or other similar resource capacity constraints. Within an Amazon Availability Zone apparently the trigger is a “failure”, but a failure of what is left to our imagination.

But also apparently missing was a critical step in any disaster recovery/high availability plan: testing. Enterprise IT not only designs and implements architectures based on reacting to a failure, but actually tests that plan. It’s often the case that such plans are tested on a yearly basis, just to ensure all the moving parts still work as expected. Relying on Amazon – or any cloud computing environment in which resources are shared – makes it very difficult to test such a plan. After all, in order to test failover from one Availability Zone to another Amazon would have to forcibly bring down an Availability Zone – and every application deployed within it. Consider how disruptive that process might be if customers started demanding such tests on their schedule. Obviously this is not conducive to Amazon maintaining its end of the uptime bargain for those customers not deployed in multiple Availability Zones.

AVAILBILITY and RELIABILITY: CRITICAL CONCERNS BORNE OUT by AMAZON FAILURE

Without the ability to test such plans, we return to the core issue – trust. Organizations relying wholly on cloud computing providers must trust the provider explicitly. And that generally means someone in the organization must understand how things work. Black boxes should not be invisible boxes, and the innate lack of visibility into the processes and architecture of cloud computing providers will eventually become as big a negative as security was once perceived to be.

Interestingly enough, a 2011 IDG report on global cloud computing adoption shows that high performance (availability and reliability) are the most important – 5% higher than security. Amazon’s epic failure will certainly do little to alleviate concerns that public cloud computing is not ready for mission critical applications.

Now, it’s been posited that customers were at fault for trusting Amazon in the first place. Others laid the blame solely on the shoulders of Amazon. Both are to blame, but Amazon gets to shoulder a higher share of that blame if for no other reason than it failed to provide the information necessary for customers to make an informed choice regarding whether or not to trust their implementation. This has been a failing of Amazon’s since it first began offering cloud computing services – it has consistently failed to offer the details necessary for customers to understand how basic processes are implemented within its environment. And with no ability to test failover across Availability Zones, organizations are at fault for trusting a technology without understanding how it works. What’s worse, many simply didn’t even care – until last week.

Now it may be the case that Amazon is more than willing to detail such information to customers; that it has adopted a “need to know” policy regarding its architecture and implementation and its definition of “failure.” If that is the case, then it behooves customers to ask before signing on the dotted line. Because customers do need to know the details to ensure they are comfortable with the level of reliability and high-availability being offered. If that’s not the case or customers are not satisfied with the answers, then it behooves them to – as has been suggested quite harshly by many cloud pundits – implement alternative plans that involve more than one provider.

A massive failure on the part of a public cloud computing provider was bound to happen eventually. If not Amazon than Azure, if not Azure then it would have been someone else. What’s important now is take stock and learn from the experience – both providers and customers – such that a similar event does not result in the same epic failure again. Two key points stand out:

The need to understand how services work when they are provided by a cloud computing provider. Black box mentality is great marketing (hey, no worries!) but in reality it’s dangerous to the health and well-being of applications deployed in such environments because you have very little visibility into what’s really going on. The failure to understand how Amazon’s Availability Zones actually worked – and exactly what constituted “isolation” aside from separate power sources as well as what constitutes a “failure” – lies strictly on the customer. Someone within the organization needs to understand how such systems work from the bottom to the top to ensure that such measures meet requirements.
The need to implement a proper disaster recovery / high availability architecture and test it. Proper disaster recovery / high availability architectures are driven by operational goals which are driven by business requirements. A requirement for 100% uptime will likely never be satisfied by a single site, regardless of provider claims.

And if you can’t test it but need to guarantee high uptime and can’t get the details necessary to understand – and trust – the implementation to provide that uptime, perhaps it’s not the right choice in the first place.

Or perhaps you just need to demand an explanation.