elb
3 TopicsAmazon Outage Casts a Shadow on SDN
#SDN #AWS #Cloud Amazon’s latest outage casts a shadow on the ability of software-defined networking to respond to catastrophic failure Much of the chatter regarding the Amazon outage has been focused on issues related to global reliability and failover and multi-region deployments. The issue of costs associated with duplicating storage and infrastructure services has been raised, and much advice given on how to avoid the negative impact of a future outage at any cloud provider. But reading through the issues discovered during the outages caused specifically by Amazon’s control plane for EC2 and EBS one discovers a more subtle story. After reading, it seems easy to come to the conclusion that Amazon’s infrastructure is, in practice if not theory, a SDN-based network architecture. Control planes (with which customers and systems interact via its API) are separated from the actual data planes, and used to communicate constantly to assure service quality and perform more mundane operations across the entire cloud. After power was restored, the problems with this approach to such a massive system became evident in the inability of its control plane to scale. The duration of the recovery time for the EC2 and EBS control planes was the result of our inability to rapidly fail over to a new primary datastore. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete. -- Summary of the AWS Service Event in the US East Region This architecture is similar to the one described by SDN proponents, where control is centralized and orders are dispatched through a single controller. In the case of Amazon, that single controller is a shared queue. As we know now, this did not scale well. While recovery time duration may be tied to the excessive time it took to fail over to a new primary data store, the excruciating slowness with which services were ultimately restored to customer’s customers was almost certainly due exclusively to the inability of the control plane to scale under load. This is not a new issue. The inability of SDN to ultimately scale in the face of very high loads has been noted by many experts who cite an inability the scale inserts into networking infrastructure via such an architecture in conjunction with inadequate response times as the primary cause of failure to scale. Traditional load balancing services – both global and local – deal with failure through redundancy and state mirroring. ELB mimics state mirroring through the use of a shared data store, much in the same way applications share state by sharing a data store. The difference is that the traditional load balancing services are able to detect and react to failures in sub-second time, whereas a distributed, shared application-based system cannot. In fact, one instance of ELB is unlikely to be aware another has failed by design – only the controller of the overarching system is aware of such failures as it is the primary mechanism through which such failures are addressed. Traditional load balancing services are instantly aware of such failures, and enact counter-measures automatically – without being required to wait for customers to move resources from one zone to another to compensate. A traditional load balancing architecture is designed to address this failure automatically, it is one of the primary purposes for which load balancers are designed and used across the globe today. This difference is not necessarily apparent or all that important in day to day operations when things are running smoothly. They only rise to the surface in the event of a catastrophic failure, and even then in a well-architected system they are not cause for concern, but rather relief. One can extend the issues with this SDN-like model for load balancing to the L2-3 network services SDN is designed to serve. The same issues with shared queues and a centralized model will be exposed in the event of a catastrophic failure. Excessive requests in the shared queue (or bus) result in the inability of the control plane to adequately scale to meet the demand experienced when the entire network must “come back online” after an outage. Even if the performance of an SDN is acceptable during normal operations, its ability to restore the network after a failure may not be. It would be unwise to ignore the issues experienced by Amazon because it does not call its ELB architecture SDN. In every sense of the term, it acts like an SDN for L4+ and this outage has exposed a potentially fatal flaw in the architecture that must be addressed moving forward. LESSON LEARNED: SDN requires that both the control and data planes be architected for failure, and able to respond and scale instantaneously. Applying ‘Centralized Control, Decentralized Execution’ to Network Architecture WILS: Virtualization, Clustering, and Disaster Recovery OpenFlow/SDN Is Not A Silver Bullet For Network Scalability Summary of the AWS Service Event in the US East Region After The Storm: Architecting AWS for Reliability QoS without Context: Good for the Network, Not So Good for the End user SDN, OpenFlow, and Infrastructure 2.0247Views0likes0CommentsCloud Load Balancing Fu for Developers Helps Avoid Scaling Gotchas
If you don’t know how scaling services work in a cloud environment you may not like the results One of the benefits of cloud computing, and in particular IaaS (Infrastructure as a Service) is that the infrastructure is, well, a service. It’s abstracted, and that means you don’t need to know a lot about the nitty-gritty details of how it works. Right? Well, mostly right. While there’s no reason you should need to know how to specifically configure, say, an F5 BIG-IP load balancing solution when deploying an application with GoGrid, you probably should understand the implications of using the provider’s API to scale using that load balancing solution. If you don’t you may run into a “gotcha” that either leaves you scratching your head or reaching for your credit card. And don’t think you can sit back and be worry free, oh Amazon Web Services customer, because these “gotchas” aren’t peculiar to GoGrid. Turns out AWS ELB comes with its own set of oddities and, ultimately, may lead many to come to the same conclusion cloud proponents have come to: cloud is really meant to scale stateless applications. Many of the “problems” developers are running into could be avoided by a combination of more control over the load balancing environment and a basic foundation in load balancing. Not just how load balancing works, most understand that already, but how load balancers work. The problems that are beginning to show themselves aren’t because of how traffic is distributed across application instances or even understanding of persistence (you call it affinity or sticky sessions) but in the way that load balancers are configured and interact with the nodes (servers) that make up the pools of resources (application instances) it is managing.158Views0likes0CommentsAmazon Makes the Cloud Sticky
Stateless applications may be the long term answer to scalability of applications in the cloud, but until then, we need a solution like sticky sessions (persistence) Amazon recently introduced “stickiness” to its ELB (Elastic Load Balancing) offering. I’ve written a bit about “stickiness”, a.k.a. what we’ve called persistence for oh, nearly ten years now, before so I won’t reiterate again but to say, “it’s about time.” A description of why sticky sessions is necessary was offered in the AWS blog announcing the new feature: Up until now each Load balancer had the freedom to forward each incoming HTTP or TCP request to any of the EC2 instances under its purview. This resulted in a reasonably even load on each instance, but it also meant that each instance would have to retrieve, manipulate, and store session data for each request without any possible benefit from locality of reference. -- New Elastic Load Balancing Feature: Sticky Sessions What the author is really trying to say is that without “sticky sessions” ELB breaks applications because it does not honor state. Remember that most web applications today rely upon state (session) to store quite a bit of application and user specific data that’s necessary for the application to behave properly. When a load balancer distributes requests across instances without consideration for where that state (session) is stored, the application behavior can become erratic and unpredictable. Hence the need for “stickiness”.208Views0likes0Comments