Amazon Outage Casts a Shadow on SDN

#SDN #AWS #Cloud Amazon’s latest outage casts a shadow on the ability of software-defined networking to respond to catastrophic failure

Much of the chatter regarding the Amazon outage has been focused on issues related to global reliability and failover and multi-region deployments. The issue of costs associated with duplicating storage and infrastructure services has been raised, and much advice given on how to avoid the negative impact of a future outage at any cloud provider.

But reading through the issues discovered during the outages caused specifically by Amazon’s control plane for EC2 and EBS one discovers a more subtle story. After reading, it seems easy to come to the conclusion that Amazon’s infrastructure is, in practice if not theory, a SDN-based network architecture. Control planes (with which customers and systems interact via its API) are separated from the actual data planes, and used to communicate constantly to assure service quality and perform more mundane operations across the entire cloud.

After power was restored, the problems with this approach to such a massive system became evident in the inability of its control plane to scale.

The duration of the recovery time for the EC2 and EBS control planes was the result of our inability to rapidly fail over to a new primary datastore.

Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.

-- Summary of the AWS Service Event in the US East Region

This architecture is similar to the one described by SDN proponents, where control is centralized and orders are dispatched through a single controller. In the case of Amazon, that single controller is a shared queue. As we know now, this did not scale well. While recovery time duration may be tied to the excessive time it took to fail over to a new primary data store, the excruciating slowness with which services were ultimately restored to customer’s customers was almost certainly due exclusively to the inability of the control plane to scale under load.

This is not a new issue. The inability of SDN to ultimately scale in the face of very high loads has been noted by many experts who cite an inability the scale inserts into networking infrastructure via such an architecture in conjunction with inadequate response times as the primary cause of failure to scale.

Traditional load balancing services – both global and local – deal with failure through redundancy and state mirroring. ELB mimics state mirroring through the use of a shared data store, much in the same way applications share state by sharing a data store. The difference is that the traditional load balancing services are able to detect and react to failures in sub-second time, whereas a distributed, shared application-based system cannot. In fact, one instance of ELB is unlikely to be aware another has failed by design – only the controller of the overarching system is aware of such failures as it is the primary mechanism through which such failures are addressed. Traditional load balancing services are instantly aware of such failures, and enact counter-measures automatically – without being required to wait for customers to move resources from one zone to another to compensate. A traditional load balancing architecture is designed to address this failure automatically, it is one of the primary purposes for which load balancers are designed and used across the globe today.

This difference is not necessarily apparent or all that important in day to day operations when things are running smoothly. They only rise to the surface in the event of a catastrophic failure, and even then in a well-architected system they are not cause for concern, but rather relief.

One can extend the issues with this SDN-like model for load balancing to the L2-3 network services SDN is designed to serve. The same issues with shared queues and a centralized model will be exposed in the event of a catastrophic failure. Excessive requests in the shared queue (or bus) result in the inability of the control plane to adequately scale to meet the demand experienced when the entire network must “come back online” after an outage. Even if the performance of an SDN is acceptable during normal operations, its ability to restore the network after a failure may not be.

It would be unwise to ignore the issues experienced by Amazon because it does not call its ELB architecture SDN. In every sense of the term, it acts like an SDN for L4+ and this outage has exposed a potentially fatal flaw in the architecture that must be addressed moving forward.

LESSON LEARNED: SDN requires that both the control and data planes be architected for failure, and able to respond and scale instantaneously.