Mission Impossible: Stateful Cloud Failover

The quest for truly stateful failover continues…

Lightning was the latest cause of an outage at Amazon, this time in its European zones. Lightning, like tornadoes, volcanoes, and hurricanes are often categorized as “Acts of God” and therefore beyond the sphere of control of, well, anyone other than God. Outages or damages caused by such are rarely reimbursable and it’s very hard to blame an organization for not having a “plan” to react to the loss of both primary and secondary power supplies due to intense lightning strikes. The odds of a lightning strike are pretty high in the first place – 576000 to 1 – and though the results can be disastrous, such risk is often categorized as low enough to not warrant a specific “plan” to redress.

What’s interesting about the analysis of the outage is the focus on what is, essentially, stateful failover capabilities. The Holy Grail of disaster recovery is to design a set of physically disparate systems in which the secondary system, in the event the primary fails, immediately takes over with no interruption in service or loss of data.

Yes, you read that right: it’s a zero-tolerance policy with respect to service and data loss.

And we’ve never, ever achieved it. Not that we haven’t tried, mind you, as Charles Babcock points out in his analysis of the outage:

Some companies now employ a form of disaster recovery that stores a duplicate set of virtual machines at a separate site; they're started up in the event of failure at the primary site. But Kodukula said such a process takes several minutes to get systems started at an alternative site. It also results in loss of several minutes worth of data.

Another alternative is to set up a data replication system to feed real-time data into the second site. If systems are kept running continuously, they can pick up the work of the failed systems with a minimum of data loss, he said. But companies need to employ their coordination expertise to make such a system work, and some data may still be lost.

-- Amazon Cloud Outage: What Can Be Learned? (Charles Babcock, InformationWeek, August 2011)

Disaster recovery plans are designed and implemented with the intention of minimizing loss. Practitioners are well aware that a zero-tolerance policy toward data loss for disaster recovery architectures is unrealistic. That is in part due to the “weakest link” theory that says a system’s is only as good as its weakest component’s . No application or network component can perform absolutely zero-tolerance failover, a.k.a. stateful failover, on its own. There is always the possibility that some few connections, transactions or sessions will be lost when a system fails over to a secondary system. Consider it an immutable axiom of computer science that distributed systems can never be data-level consistent. Period. If you think about it, you’ll see why we can deduce, then, that we’ll likely never see stateful failover of network devices. Hint: it’s because ultimately all state, even in network components, is stored in some form of “database” whether structured, unstructured, or table-based and distributed systems can never be data-level consistent. And if even a single connection|transaction|session is lost from a component’s table, it’s not stateful, because stateful implies zero-tolerance for loss.

WHY ZERO-TOLERANCE LOSS is IMPOSSIBLE

Now consider what we’re trying to do in a failover situation. Generally speaking we’re talking about component-level failure which, in theory and practice, is much easier than a full-scale architectural failover scenario. One device fails, the secondary takes over. As long as data has been synchronized between the two, we should theoretically be able to achieve stateful failover, right?

Except we can’t. One of the realities with respect to high availability architectures is that synchronization is not continuous and neither is heartbeat monitoring (the mechanism by which redundant pairs periodically check to ensure the primary is still active). These processes occur on a period interval as defined by operational requirements, but are generally in the 3-5 second range. Assuming a connection from a client is made at point A, and the primary component fails at point A+1 second, it is unlikely that its session data will be replicated to the secondary before point A+3 seconds, at which time the secondary determines the primary has failed and takes over operation. This “miss” results in data loss. A minute amount, most likely, but it’s still data loss.

Basically the axiom that zero-tolerance loss is impossible is a manifestation in the network infrastructure of Brewer’s CAP theorem at work which says you cannot simultaneously have Consistency, Availability and Partition Tolerance. This is evident before we even consider the slight delay that occurs on the network despite the use of gratuitous ARP to ensure a smooth(er) transition between units in the event of a failover, during which time the service may be (rightfully) perceived as unavailable. But we don’t need to complicate things any more than they already are, methinks.

What is additionally frustrating, perhaps, is that the data loss could potentially be caused by a component other than the one that fails. That is, because of the interconnected and therefore interdependent nature of the network, a sort of cascading effect can occur in the event of a failover based on the topological design of the systems. It’s the architecture, silly, that determines whether data loss will be localized to the failing components or cascade throughout the entire architecture. High availability architectures based on a parallel data path design are subject to higher data loss throughout the network than are those based on cross-connected data path designs. Certainly the latter is more complicated and harder to manage, but it’s less prone to data loss cascade throughout the infrastructure.

Now, add in the possibility that cloud-based disaster recovery systems necessarily leverage a network connection instead of a point-to-point serial connection. Network latency can lengthen the process and, if the failure is in the network connection itself, is obviously going to negatively impact the amount of data lost because synchronization cannot occur at all for that period of time when the failed primary is still “active” and the secondary realizes there’s a problem, Houston, and takes over responsibility. Now take these potential hiccups and multiply them by every redundant component in the system you are trying to assure availability for, and remember to take into consideration that to fail over to a second “site” requires not only data (as in database) replication but also state data replication across the entire infrastructure.

Clearly, this is an unpossible task. A truly stateful cloud-based failover might be able to occur if the stars were aligned just right and the chickens sacrificed and the no-fail dance performed to do so. And even then, I’d bet against it happening. The replication of state residing in infrastructure components, regardless of how necessary they may be, is almost never, ever attempted.

The reality is that we have to count on some data loss and the best strategy we can have is to minimize the loss in part by minimizing the data that must be replicated and components that must be failed over.

Or is it?