Load Balancing Fu: Beware the Algorithm and Sticky Sessions
The choice of load balancing algorithms can directly impact – for good or ill – the performance, behavior and capacity of applications. Beware making incompatible choices in architecture and algorithms. One of the most persistent issues encountered when deploying applications in scalable architectures involves sessions and the need for persistence-based (a.k.a. sticky) load balancing services to maintain state for the duration of an end-user’s session. It is common enough that even the rudimentary load balancing services offered by cloud computing providers such as Amazon include the option to enable persistence-based load balancing. While the use of persistence addresses the problem of maintaining session state, it introduces other operational issues that must also be addressed to ensure consistent operational behavior of load balancing services. In particular, the use of the Round Robin load balancing algorithm in conjunction with persistence-based load balancing should be discouraged if not outright disallowed. ROUND ROBIN + PERSISTENCE –> POTENTIALLY UNEQUAL DISTRIBUTION of LOAD When scaling applications there are two primary concerns: concurrent user capacity and performance. These two concerns are interrelated in that as capacity is consumed, performance degrades. This is particularly true of applications storing state as each request requires that the application server perform a lookup to retrieve the user session. The more sessions stored, the longer it takes to find and retrieve the session. The exactly efficiency of such lookups is determined by the underlying storage data structure and algorithm used to search the structure for the appropriate session. If you remember your undergraduate classes in data structures and computing Big (O) you’ll remember that some structures scale more efficiently in terms of performance than do others. The general rule of thumb, however, is that the more data stored, the longer the lookup. Only the amount of degradation is variable based on the efficiency of the algorithms used. Therefore, the more sessions in use on an application server instance, the poorer the performance. This is one of the reasons you want to choose a load balancing algorithm that evenly distributes load across all instances and ultimately why lots of little web servers scaled out offer better performance than a few, scaled up web servers. Now, when you apply persistence to the load balancing equation it essentially interrupts the normal operation of the algorithm, ignoring it. That’s the way it’s supposed to work: the algorithm essentially applies only to requests until a server-side session (state) is established and thereafter (when the session has been created) you want the end-user to interact with the same server to ensure consistent and expected application behavior. For example, consider this solution note for BIG-IP. Note that this is true of all load balancing services: A persistence profile allows a returning client to connect directly to the server to which it last connected. In some cases, assigning a persistence profile to a virtual server can create the appearance that the BIG-IP system is incorrectly distributing more requests to a particular server. However, when you enable a persistence profile for a virtual server, a returning client is allowed to bypass the load balancing method and connect directly to the pool member. As a result, the traffic load across pool members may be uneven, especially if the persistence profile is configured with a high timeout value. -- Causes of Uneven Traffic Distribution Across BIG-IP Pool Members So far so good. The problem with round robin- – and reason I’m picking on Round Robin specifically - is that round robin is pretty, well, dumb in its decision making. It doesn’t factor anything into its decision regarding which instance gets the next request. It’s as simple as “next in line", period. Depending on the number of users and at what point a session is created, this can lead to scenarios in which the majority of sessions are created on just a few instances. The result is a couple of overwhelmed instances (with performance degradations commensurate with the reduction in available resources) and a bunch of barely touched instances. The smaller the pool of instances, the more likely it is that a small number of servers will be disproportionately burdened. Again, lots of little (virtual) web servers scales out more evenly and efficiently than a few big (virtual) web servers. Assuming a pool of similarly-capable instances (RAM and CPU about equal on all) there are other load balancing algorithms that should be considered more appropriate for use in conjunction with persistence-based load balancing configurations. Least connections should provide better distribution, although the assumption that an active connection is equivalent to the number of sessions currently in memory on the application server could prove to be incorrect at some point, leading to the same situation as would be the case with the choice of round robin. It is still a better option, but not an infallible one. Fastest response time is likely a better indicator of capacity as we know that responses times increase along with resource consumption, thus a faster responding instance is likely (but not guaranteed) to have more capacity available. Again, this algorithm in conjunction with persistence is not a panacea. Better options for a load balancing algorithm include those that are application aware; that is, algorithms that can factor into the decision making process the current load on the application instance and thus direct requests toward less burdened instances, resulting in a more even distribution of load across available instances. NON-ALGORITHMIC SOLUTIONS There are also non-algorithmic, i.e. architectural, solutions that can address this issue. DIVIDE and CONQUER In cloud computing environments, where it is less likely to find available algorithms other than industry standard (none of which are application-aware), it may be necessary to approach the problem with a divide and conquer strategy, i.e. lots of little servers. Rather than choosing one or two “large” instances, choose to scale out with four or five “small” instances, thus providing a better (but not guaranteed) statistical chance of load being distributed more evenly across instances. FLANKING STRATEGY If the option is available, an architectural “flanking” strategy that leverages layer 7 load balancing, a.k.a. content/application switching, will also provide better consumptive rates as well as more consistent performance. An architectural strategy of this sort is in line with sharding practices at the data layer in that it separates out by some attribute different kinds of content and serves that content from separate pools. Thus, image or other static content may come from one pool of resources while session-oriented, process intensive dynamic content may come from another pool. This allows different strategies – and algorithms – to be used simultaneously without sacrificing the notion of a single point of entry through which all users interact on the client-side. Regardless of how you choose to address the potential impact on capacity, it is important to recognize the intimate relationship between infrastructure services and applications. A more integrated architectural approach to application delivery can result in a much more efficient and better performing application. Understanding the relationship between delivery services and application performance and capacity can also help improve on operational costs, especially in cloud computing environments that constrain the choices of load balancing algorithms. As always, test early and test often and test under high load if you want to be assured that the load balancing algorithm is suitable to meet your operational and business requirements. WILS: Why Does Load Balancing Improve Application Performance? Load Balancing in a Cloud Infrastructure Scalability Pattern: Sharding Sessions Infrastructure Scalability Pattern: Partition by Function or Type It’s 2am: Do You Know What Algorithm Your Load Balancer is Using? Lots of Little Virtual Web Applications Scale Out Better than Scaling Up Sessions, Sessions Everywhere Choosing a Load Balancing Algorithm Requires DevOps Fu Amazon Makes the Cloud Sticky To Boldly Go Where No Production Application Has Gone Before Cloud Testing: The Next Generation2.3KViews0likes1CommentBack to Basics: The Theory of (Performance) Relativity
#webperf #ado Choice of load balancing algorithms is critical to ensuring consistent and acceptable performance One of the primary reasons folks use a Load balancer is scalability with a secondary driver of maintaining performance. We all know the data exists to prove that "seconds matter" and current users of the web have itchy fingers, ready to head for the competition the microsecond they experience any kind of delay. Similarly, we know that productivity is inherently tied to performance. With more and more critical business functions "webified", the longer it takes to load a page the longer the delay a customer or help desk service representative experiences, reducing the number of calls or customers that can be serviced in any given measurable period. So performance is paramount, I see no reason to persuade you further to come to that conclusion. Ensuring performance then is a vital operational directive. One of the ways operations tries to meet that objective is through load balancing. Distributing load ensures available and can be used to offset any latency introduced by increasing capacity (again, I don't think there's anyone who'll argue against the premise that load and performance degradation are inherently tied together). But just adding a load balancing service isn't enough. The algorithm used to distribute load will invariably impact performance – for better or for worse. Consider the industry standard "fastest response time" algorithm. This algorithm distributes load based on the historical performance of each instance in the pool (farm). On the surface, this seems like a good choice. After all, what you want is the fastest response time, so why not base load balancing decisions on the metric against which you are going to be measured? The answer is simple: "fastest" is relative. With very light load on a pool of, say, three servers, "fastest" might mean sub-second responses. But as load increases and performance decreases, "fastest" might start creeping up into the seconds – if not more. Sure, you're still automagically choosing the fastest of the three servers, but "fastest" is absolutely relative to the measurements of all three servers. Thus, "fastest response time" is probably a poor choice if one of your goals is measured in response time to the ultimate customer – unless you combine it with an upper connection limit. HOW TO USE "FASTEST RESPONSE TIME" ALGORITHMS CORRECTLY One of the negatives of adopting a cloud computing paradigm with a nearly religious-like zeal is that you buy into the notion that utilization is the most important metric in the data center. You simply do not want to be wasting CPU cycles, because that means you're inefficient and not leveraging cloud to its fullest potential. Well, horse-puckey. The reality is that 100% utilization and consistently well-performing applications do not go hand in hand. Period. You can have one, but not the other. You're going to have to choose which is more important a measurement – fast applications or full utilization. In the six years I spent load testing everything from web applications to web application firewalls to load balancers to XML gateways one axiom always, always, remained true: As load increases performance decreases. You're welcome to test and retest and retest again to prove that wrong, but good luck. I've never seen performance increase or even stay the same as utilization approaches 100%. Now, once you accept that reality you can use it to your advantage. You know that performance is going to decrease as load increases, you just don't know at what point the degradation will become unacceptable to your users. So you need to test to find that breaking point. You want to stress the application and measure the degradation, noting the number of concurrent connections at which performance starts to degrade into unacceptable territory. That is your connection limit. Keep track of that limit (for the application, specifically, because not all applications will have the same limits). When you configure your load balancing service you can now select fastest response time but you also need to input hard connection limits on a per-instance basis. This prevents each instance from passing through the load-performance confluence that causes end-users to start calling up the help desk or sighing "the computer is slow" while on the phone with their customers. This means testing. Not once, not twice, but at least three runs. Make sure you've found the right load-performance confluence point and write it down. On your hand, in permanent marker. While cloud computing and virtualization have certainly simplified load balancing services in terms of deployment, it's still up to you to figure out the right settings and configuration options to ensure that your applications are performing with the appropriate levels of "fast". Back to Basics: Load balancing Virtualized Applications Performance in the Cloud: Business Jitter is Bad The “All of the Above” Approach to Improving Application Performance The Three Axioms of Application Delivery Face the facts: Cloud performance isn't always stable Data Center Feng Shui: Architecting for Predictable Performance A Formula for Quantifying Productivity of Web Applications Enterprise Apps are Not Written for Speed313Views0likes0CommentsCloud is not Rocket Science but it is Computer Science
That doesn’t mean it isn’t hard - it means it’s a different kind of hard. For many folks in IT it is likely you might find in their home a wall on which you can find hanging a diploma. It might be a BA, it might be a BS, and you might even find one (or two) “Master of Science” as well. Now interestingly enough, none of the diplomas indicate anything other than the level of education (Bachelor or Master) and the type (Arts or Science). But we all majored in something, and for many of the people who end up in IT that something was Computer Science. There was not, after all, an option to earn a “MS of Application Development” or a “BS of Devops”. While many higher education institutions offer students the opportunity to emphasize in a particular sub-field of Computer Science, that’s not what the final degree is in. It’s almost always a derivation of Computer Science. Yet when someone asks – anyone, regardless of technological competency – you what you do, you don’t reply “I’m a computer scientist.” You reply “I’m a sysadmin” or “I’m a network architect” or “I’m a Technical Marketing Manager” (which in the technological mecca of the midwest that is Green Bay gets some very confused expressions in response). We don’t describe ourselves as “computer scientists” even though by education that’s what we are. And what we practice is, no matter what our focus is at the moment, computer science. The scripts, the languages, the compilers, the technology – they’re just tools. They’re a means to an end. CLOUD is COMPUTER SCIENCE The definition of computer science includes the word “computer” as a means to limit the field of study. It is not intended to limit the field to a focus on computers and ultimately we should probably call it computing science because that would free us from the artificial focus on specific computing components. Computer science or computing science (sometimes abbreviated CS) is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems. It is frequently described as the systematic study of algorithmic processes that create, describe, and transform information. [emphasis added] -- Wikipedia, “Computer Science” Interestingly enough, Christofer Hoff recently made this same observation in perhaps a more roundabout but absolutely valid arrangement of words: Cloud is only rocket science if you’re NASA and using the Cloud for rocket science. Else, for the rest of us, it’s an awesome platform upon which we leverage various opportunities to improve the way in which we think about and implement the practices and technology needed to secure the things that matter most to us. [emphasis added] /Hoff -- Hoff’s 5 Rules Of Cloud Security… Hoff is speaking specifically to security, but you could just as easily replace “secure” with “deliver” or “integrate” or “automate”. It’s not about the platform, it’s the way in which we leverage and think about and implement solutions. It is, at its core, about an architecture; a way of manipulating data and delivering it to a person so that it can become information. It’s computing science, the way in which we combine and apply compute resources – whether network or storage or server – to solve a particular (business) problem. That is, in a nutshell, the core of what cloud computing really is. It’s “computer science” with an focus on architecting a system by which the computing resources necessary to secure, optimize, and deliver applications can be achieved most efficiently. COMPONENTS != CLOUD Virtualization, load balancing, and server time-sharing are not original concepts. Nor are the myriad infrastructure components that make up the network and application delivery network and the storage network. Even most of the challenges are not really “new”, they’re just instantiations of existing challenges (integration, configuration management, automation, and IP address management) that are made larger and more complex by the sheer volume of systems being virtualized, connected, and networked. What is new are the systems and architectures that tie these disparate technologies together to form a cohesive operating environment in which self-service IT is a possibility and the costs of managing all those components are much reduced. Cloud is about the infrastructure and how the rest of the infrastructure and applications collaborate, integrate, and interact with the ecosystem in order to deliver applications that are available, fast, secure, efficient, and affordable. The cost efficiency of cloud comes from its multi-tenant model – sharing resources. The operational efficiency, however, comes from the integration and collaborative nature of its underlying infrastructure. It is the operational aspects of cloud computing that make self-service IT possible, that enable a point-and-click provisioning of services to be possible. That infrastructure is comprised of common components that are not new or unfamiliar to IT, but the way in which it interacts and collaborates with its downstream and upstream components is new. That’s the secret sauce, the computer science of cloud. WHY is THIS IMPORTANT to REMEMBER It is easy to forget that the networks and application architectures that make up a data center or a cloud are founded upon the basics we learned from computer science. We talk about things like “load balancing algorithms” and choosing the best one to meet business or technical needs, but we don’t really consider what that means that the configuration decisions we’re making are ultimately making a choice between well-known and broadly applicable algorithms some of which carry very real availability and performance implications. When we try to automate capacity planning (elastic scalability) we’re really talking about codifying a decision problem in algorithmic fashion. We may not necessarily use formal statements and proofs to explain the choices for one algorithm or another, or the choice to design a system this way instead of that, but that formality and the analysis of our choices is something that’s been going on, albeit perhaps subconsciously. The phrase “it isn’t rocket science” is generally used to imply that “it” isn’t difficult or requiring of special skills. Cloud is not rocket science, but it is computer science, and it will be necessary to dive back into some of the core concepts associated with computer science in order to design the core systems and architectures that as a whole are called “cloud”. We (meaning you) are going to have to make some decisions, and many of them will be impacted – whether consciously or not – by the core foundational concepts of computer science. Recognizing this can do a lot to avoid the headaches of trying to solve problems that are, well, unsolvable and point you in the direction of existing solutions that will serve well as you continue down the path to dynamic infrastructure maturity, a.k.a. cloud. Related Posts219Views0likes1CommentChoosing a Load Balancing Algorithm Requires DevOps Fu
Knowing the algorithms is only half the battle, you’ve got to understand a whole lot more to design a scalable architecture. Citrix’s Craig Ellrod has a series of blog posts on the basic (industry standard) load balancing algorithms. These are great little posts for understanding the basics of load balancing algorithms like round robin, least connections, and least (fastest) response time. Craig’s posts are accurate in their description of the theoretical (designed) behavior of the algorithms. The thing that’s missing from these posts (and maybe Craig will get to this eventually) is context. Not the context I usually talk about, but the context of the application that is being load balanced and the way in which modern load balancing solutions behave, which is not as a simple Load balancer. Different applications have different usage patterns. Some are connection heavy, some are compute intense, some return lots of little responses while others process larger incoming requests. Choosing a load balancing algorithm without understand the behavior of the application and its users will almost always result in an inefficient scaling strategy that leaves you constantly trying to figure out why load is unevenly distributed or SLAs are not being met or why some users are having a less than stellar user experience. One of the most misunderstood aspects of load balancing is that a load balancing algorithm is designed to choose from a pool of resources and that an application is or can be made up of multiple pools of resources. These pools can be distributed (cloud balancing) or localized, and they may all be active or some may be designated as solely existing for failover purposes. Ultimately this means that the algorithm does not actually choose the pool from which a resource will be chosen – it only chooses a specific resource. The relationship between the choice of a pool and the choice of a single resource in a pool is subtle but important when making architectural decisions – especially those that impact scalability. A pool of resources is (or should be) a set of servers serving similar resources. For example, the separation of image servers from application logic servers will better enable scalability domains in which each resource can be scaled individually, without negatively impacting the entire application. To make the decision more complex, there are a variety of factors that impact the way in which a load balancing algorithm actually behaves in contrast to how it is designed to act. The most basic of these factors is the network layer at which the load balancing decision is being made. THE IMPACT of PROTOCOL There are two layers at which applications are commonly load balanced: TCP (transport) and HTTP (application). The layer at which a load balancing is performed has a profound effect on the architecture and capabilities. Layer 4 (Connection-oriented) Load Balancing When load balancing at layer 4 you are really load balancing at the TCP or connection layer. This means that a connection (user) is bound to the server chosen on the initial request. Basically a request arrives at the load balancer and, based on the algorithm chosen, is directed to a server. Subsequent requests over the same connection will be directed to that same server. Unlike Layer 7-related load balancing, in a layer 4 configuration the algorithm is usually the routing mechanism for the application. Layer 7 (Connection-oriented) Load Balancing When load balancing at Layer 7 in a connection-oriented configuration, each connection is treated in a manner similar to Layer 4 Load Balancing with the exception of how the initial server is chosen. The decision may be based on an HTTP header value or cookie instead of simply relying on the load balancing algorithm. In this scenario the decision being made is which pool of resources to send the connection to. Subsequently the load balancing algorithm chosen will determine which resource within that pool will be assigned. Subsequent requests over that same connection will be directed to the server chosen. Layer 7 connection-oriented load balancing is most useful in a virtual hosting scenario in which many hosts (or applications) resolve to the same IP address. Layer 7 (Message-oriented) Load Balancing Layer 7 load balancing in a message-oriented configuration is the most flexible in terms of the ability to distribute load across pools of resources based on a wide variety of variables. This can be as simple as the URI or as complex as the value of a specific XML element within the application message. Layer 7 message-oriented load balancing is more complex than its Layer 7 connection-oriented cousin because a message-oriented configuration also allows individual requests – over the same connection – to be load balanced to different (virtual | physical) servers. This flexibility allows the scaling of message-oriented protocols such as SIP that leverage a single, long-lived connection to perform tasks requiring different applications. This makes message-oriented load balancing a better fit for applications that also provide APIs as individual API requests can be directed to different pools based on the functionality they are performing, such as separating out requests that update a data source from those that simply read from a data source. Layer 7 message-oriented load balancing is also known as “request switching” because it is capable of making routing decisions on every request even if the requests are sent over the same connection. Message-oriented load balancing requires a full proxy architecture as it must be the end-point to the client in order to intercept and interpret requests and then route them appropriately. OTHER FACTORS to CONSIDER If that were not confusing enough, there are several other factors to consider that will impact the way in which load is actually distributed (as opposed to the theoretical behavior based on the algorithm chosen). Application Protocol HTTP 1.0 acts differently than HTTP 1.1. Primarily the difference is that HTTP 1.0 implies a one-to-one relationship between a request and a connection. Unless the HTTP Keep-Alive header is included with a HTTP 1.0 request, each request will incur the processing costs to open and close the connection. This has an impact on the choice of algorithm because there will be many more connections open and in progress at any given time. You might think there’s never a good reason to force HTTP 1.0 if the default is more efficient, but consider the case in which a request is for an image – you want to GET it and then you’re finished. Using HTTP 1.0 and immediately closing the connection is actually better for the efficiency and thus capacity of the web server because it does not maintain an open connection waiting for a second or third request that will not be forthcoming. An open, idle connection that will eventually simply time-out wastes resources that could be used by some other user. Connection management is a large portion of resource consumption on a web or application server, thus anything that increases that number and rate decreases the overall capacity of each individual server, making it less efficient. HTTP 1.1 is standard (though unfortunately not ubiquitous) and re-uses client-initiated connections, making it more resource efficient but introducing questions regarding architecture and algorithmic choices. Obviously if a connection is reused to request both an image resource and a data resource but you want to leverage fault tolerant and more efficient scale design using scalability domains this will impact your choice of load balancing algorithms and the way in which requests are routed to designated pools. Configuration Settings A little referenced configuration setting in all web and application servers (and load balancers) is the maximum number of requests per connection. This impacts the distribution of requests because once the configured maximum number of requests has been sent over the same connection it will be closed and a new connection opened. This is particularly impactful on AJAX and other long-lived connection applications. The choice of algorithm can have a profound affect on availability when coupled with this setting. For example, an application uses an AJAX-update on a regular interval. Furthermore, the application requests made by that updating request require access to session state, which implies some sort of persistence is required. The first request determines the server, and a cookie (as one method) subsequently ensures that all further requests for that update are sent to the same server. Upon reaching the maximum number of requests, the load balancer must re-establish that connection – and because of the reliance on session state it must send the request back to the original server. In the meantime, however, another user has requested a resource and been assigned to that same server, and that user has caused the server to reach its maximum number of connections (also configurable). The original user is unable to be reconnected to his session and regardless of whether the request is load balanced to a new resource or not, “availability” is effectively lost as the current state of the user’s “space” is now gone. You’ll note that a combination of factors are at work here: (1) the load balancing algorithm chosen, (2) configuration options and (3) application usage patterns and behavior. IT JUST ISN’T ENOUGH to KNOW HOW the ALGORITHMS BEHAVE This is the area in which devops is meant to shine – in the bridging of that gap across applications and the network, in understanding how the two interact, integrate, and work together to achieve high-scalability and well-performing applications. Devops must understand the application but they must also understand how load balancing choices and configurations and options impact the delivery and scalability of that application, and vice-versa. The behavior of an application and its users can impact the way in which a load balancing algorithm performs in reality rather than theory, and the results are often very surprising. Round-robin load balancing, is designed to equally distribute requests across a pool of resources, not workload. Thus in the course of a period of time one application instance in a pool of resources may become overloaded and either become unavailable or exhibit errors that appear only under heavy load while other instances in the same pool may be operating normally and under nominal load conditions. You certainly need to have a basic understanding of load balancing algorithms, especially moving forward toward elastic applications and cloud computing . But once you understand the basics you really need to start examining the bigger picture and determining the best set of options and configurations for each application. This is one of the reasons testing is so important when designing a scalable architecture – to uncover the strange behavior that can often only be discovered under heavy load, to examine the interaction of all the variables and how they impact the actual behavior of the load balancer at run time.344Views0likes0CommentsThe Impossibility of CAP and Cloud
It comes down to this: the on-demand provisioning and elastic scalability systems that make up “cloud” are addressing NP-Complete problems for which there is no known exact solutions. At the heart of what cloud computing provides – in addition to compute-on-demand – is the concept of elastic scalability. It is through the ability to rapidly provision resources and applications that we can achieve elastic scalability and, one assumes, through that high availability of systems. Obviously, given my relationship to F5 I am strongly interested in availability. It is, after all, at the heart of what an application delivery controller is designed to provide. So when a theorem is presented that basically says you cannot build a system that is Consistent, Available, and Partition-Tolerant I get a bit twitchy. Just about the same time that Rich Miller was reminding me of Brewer’s CAP Theorem someone from HP Labs claimed to have solved the P ≠ NP problem (shortly thereafter determined to not be a solution after all), which got me thinking about NP-Completeness in problem sets, of which solving the problem of creating a distributed CAP-compliant system certainly appears to be a member. CLOUD RESOURCE PROVISIONING is NP-COMPLETE A core conflict with cloud and CAP-compliance is on-demand provisioning. There are, after all, a minimal set of resources available (cloud is not infinitely scalable, after all) with, one assumes, each resource having a variable amount of compute availability. For example, most cloud providers use a “large”, “medium”, and “small” sizing approach to “instances” (which are, in almost all cases, a virtual machine). Each “size” has a defined set of reserved compute (RAM and CPU) for use. Customers of cloud providers provision instances by size. At first glance this should not a problem. The provisioning system is given an instruction, i.e. “provision instance type X.” The problem begins when you consider what happens next – the provisioning system must find a hardware resource with enough capacity available on which to launch the instance. In theory this certainly appears to be a variation of the Bin packing problem (which is NP-complete). It is (one hopes) resolved by the cloud provider by removing the variability of location (parameterization) or the use of approximation (using the greedy approximation algorithm “first-fit”, for example). In a pure on-demand provisioning environment, the management system would search out, in real-time, a physical server with enough physical resources available to support the requested instance requirements but it would also try to do so in a way that minimizes the utilization of physical resources on each machine so as to better guarantee availability of future requests and to be more efficient (and thus cost-effective). Brewer’s CAP Theorem It is impractical, of course, to query each physical server in real-time to determine an appropriate location, so no doubt there is a centralized “inventory” of resources available that is updated upon the successful provisioning of an instance. Note that this does not avoid the problem of NP-Completeness and the resulting lack of a solution as data replication/synchronization is also an NP-Complete problem. Now, because variability in size and an inefficient provisioning algorithm could result in a fruitless search, providers might (probably do) partition each machine based on the instance sizes available and the capacity of the machine. You’ll note that most providers size instances as multiples of the smallest, if you were looking for anecdotal evidence of this. If a large instance is 16GB RAM and 4 CPUs, then a physical server with 32 GB of RAM and 8 CPUs can support exactly two large instances. If a small instance is 4GB RAM and 1 CPU, that same server could ostensibly support a combination of both: 8 small instances or 4 small instances and 2 large instances, etc… However, that would make it difficult to keep track of the availability of resources based on instance size and would eventually result in a failure of capacity availability (which makes the system non-CAP compliant). However, not restricting the instances that can be deployed on a physical server returns us to a bin packing-like algorithm that is NP-complete which necessarily introduces unknown latency that could impact availability. This method also introduces the possibility that while searching for an appropriate location some other consumer has requested an instance that is provisioned on a server that could have supported the first consumer’s request, which results in a failure to achieve CAP-compliance by violating the consistency constraint (and likely the availability constraint, as well). The provisioning will never be “perfect” because there is no exact solution to an NP-complete problem. That means the solution is basically the fastest/best it can be given the constraints. Which we often distill down to “good enough.” That means that there are cases where either availability or consistency will be violated, making cloud in general non-CAP compliant. The core conflict is the definition of “highly available” as “working with minimal latency.” Or perhaps the real issue is the definition of “minimal”. For it is certainly the case that a management system that leverages opportunistic locking and shared data systems could alleviate the problem of consistency, but never availability. Eliminating the consistency problem by ensuring that every request has exclusive access to the “database” of instances when searching for an appropriate physical location introduces latency while others wait. This is the “good enough” solution used by CPU schedulers – the CPU scheduler is the one and only authority for CPU time-slice management. It works more than well-enough on a per-machine basis, but this is not scalable and in larger systems would result in essentially higher rates of non-availability as the number of requests grows. WHY SHOULD YOU CARE Resource provisioning and job scheduling in general are in the class of NP-complete problems. While the decision problem to choose an appropriate physical server on which to launch a set of requested instances can be considered an instantiation of the Bin packing problem, it can also be viewed as a generalized assignment problem or, depending on the parameters, a variation of the Knapsack problem, or any one of the multiprocessor scheduling problems, all of which are NP-complete. Cloud is essentially the integration of systems that provide resource provisioning and may include job scheduling as a means to automate provisioning and enable a self-service environment. Because of its reliance on problems that are NP-complete we can deduce that cloud is NP-complete. NOTE: No, I’m not going to provide a formal proof. I will leave that to someone with a better handle on the reductions necessary to prove (or disprove) that the algorithms driving cloud are either the same or derivations of existing NP-Complete problem sets. The question “why should I care if these problems are NP-Complete” is asked by just about every student in every algorithms class in every university there is. The answer is always the same: because if you can recognize that a problem you are trying to solve is NP-Complete you will not waste your time trying to solve a problem that thousands of mathematicians and computer scientists have been trying to solve for 50 years and have thus far not been able to do so. And if you do solve it, you might want to consider formalizing it, because you’ve just proved P = NP and there’s a $1,000,000 bounty out on that proof. But generally speaking, it’s a good idea to recognize them when you see them because you can avoid a lot of frustration by accepting up front you can’t solve it, and you can also leverage existing research / algorithms that have been proposed as alternatives (approximation algorithms, heuristics, parameterized algorithms, etc…) to get the “best possible” answer and get on with more important things. It also means there is no one optimal solution to “cloud”, only a variety of “good enough” or “approximately optimal” solutions. Neither the time required to provision can be consistently guaranteed or the availability of resources in a public cloud environment. This is, essentially, why the concept of reserved instances exists. Because if your priorities include high availability, you’d better consider budgeting for reserved instances, which is basically a more cost effective method of having a whole bunch of physical servers in your pool of available resources on stand-by. But if your priorities are geared toward pinching of pennies, and availability is lower on your “must have” list of requirements, then reserving instances is an unnecessary cost – as long as you’re willing to accept the possibility of lower availability. Basically, the impossibility of achieving CAP in cloud impacts (or should impact) your cloud computing strategy – whether you’re implementing locally or leveraging public resources. As I mentioned very recently – cloud is computer science, and if you understand the underlying foundations of the systems driving cloud you will be much better able to make strategic decisions regarding when and what type of cloud is appropriate and for what applications. Related Posts319Views0likes1CommentNot all application requests are created equal
ArsTechnica has an interesting little article on what Windows Azure is and is not. During the course of discussion with Steven Martin, Microsoft's senior director of Developer Platform Product Management, a fascinating – or disturbing in my opinion – statement was made: There is a distinction between the hosting world and the cloud world that Martin wanted to underline. Whereas hosting means simply the purchase of space under certain conditions (as opposed to buying the actual hardware), the cloud completely hides all issues of clustering and/or load balancing, and it offers an entirely virtualized instance that takes care of all your application's needs. [emphasis added] The reason this is disturbing is because not all application requests are created equal and therefore should not necessarily be handled in the same way by a “clustering and/or load balancing solution”. But that’s exactly what hiding clustering and/or load balancing ends up doing. While it’s nice that the nitty-gritty details are obscured in the cloud from developers and, in most cases today, the administrators as well, the lack of control over how application requests are distributed actually makes the cloud and its automatic scalability (elasticity) less effective. To understand why you need a bit of background regarding industry standard load balancing algorithms. In the beginning there was Round Robin, an algorithm that is completely application agnostic and simply distributes request based on a list of servers, one after the other. If there are five servers in a pool/farm/cluster, then each one gets a turn. It’s an egalitarian algorithm that treats all servers and all requests the same. Round Robin achieves availability, but often at the cost of application performance. When application performance became an issue we got new algorithms like Least Connections and Fastest Response Time. These algorithms tried to take into account the load on the servers in the pool/farm/cluster before making a decision, and could therefore better improve utilization such that application performance started getting better. But these algorithms only consider the server and its load, and don’t take into consideration the actual request itself. And therein lies the problem, for not all requests are created equal. A request for an image requires X processing on a server and Y memory and is usually consistent across time and users. But a request that actually invokes application logic and perhaps executes a database query is variable in its processing time and memory utilization. Some may take longer than others, and require more memory than others. Each request is a unique snowflake whose characteristics are determined by user, by resource, and by the conditions that exist at the time it was made. It turns out the problem is that in order to effectively determine how to load balance requests in a way that optimizes utilization on servers and offers the best application performance you actually have to understand the request. That epiphany gave rise to layer 7 load balancing and the ability to exact finer-grained control over load balancing. Between understanding the request and digging deeper into the server – understanding CPU utilization, memory, network capacity – load balancers were suddenly very effective at distributing load in a way that made sense on a per request basis. The result was better architectures, better performing applications, and better overall utilization of the resources available. Now comes the cloud and its “we hide all the dirty infrastructure details from you” mantra. The problem with this approach is simple: a generic load balancing algorithm is not the most effective method of distributing load across servers, but a cloud provider is not prescient and therefore has no idea what algorithm might be best for your application. Therefore the provider has very little choice in which algorithm is used for load balancing and therefore any choice made will certainly provide availability, but will likely not be the most effective for your specific application. So while it may sound nice that all the dirty details of load balancing and clustering is “taken care of for you” in the cloud, it’s actually doing you and your application a disservice. Hiding the load balancing and/or clustering capabilities of the cloud, in this case Azure, from the developer is not necessarily the bonus Martin portrays it to be. The ability to control how requests are distributed is just as important in the cloud as it is in your own data center. As Gartner analyst Daryl Plummer points out, underutilizing resources in the cloud, as may happen when using simplistic load balancing algorithms, can be as expensive as running your own data center and may negatively impact application performance. Without some input into the configuration of load balancers and other relevant infrastructure, there isn’t much you can do about that, either, but start up another instance and hope that horizontal scalability will improve performance – at the expense of your budget. Remember that when someone else makes decisions for you that you are necessarily giving up control. That’s not always a bad thing. But it’s important for you to understand what you are giving up before you hand over the reins. So do your research. You may not have direct control, but you can ask about the “clustering and/or load balancing” provided and understand what affect that may – or may not – have on the performance of your application and the effectiveness of the utilization of the resources for which you are paying.228Views0likes2CommentsIt’s 2am: Do You Know What Algorithm Your Load Balancer is Using?
The wrong load balancing algorithm can be detrimental to the performance and scalability of your web applications. When you’re mixing and matching virtual or physical servers you need to take care with how you configure your Load balancer – and that includes cloud-based load balancing services. Load balancers do not at this time, unsurprisingly, magically choose the right algorithm for distributing requests for a given environment. One of the nice things about a load balancing solution that comes replete with application-specific templates is that all the work required to determine the optimal configuration for the load balancer and its associated functionality (web application security, acceleration, optimization) has already been done – including the choice of the right algorithm for that application. But for most applications there are no such templates, no guidance, nothing. Making things more difficult are heterogeneous environments in which the compute resources available vary from instance to instance. These variations make some load balancing algorithms unsuited to such environments. There is some general guidance you can use when trying to determine which algorithm is best suited to meeting the performance and scalability needs of your applications based on an understanding of how the algorithms are designed to make decisions, but if you want optimal performance and scalability you’ll ultimately have to do some testing. Heterogeneous environments can pose a challenge to scale if careful consideration of load balancing algorithms is not taken. Whether the limitations on compute resources are imposed by a virtualization solution or the hardware itself, limitations that vary from application instance to application instance are an important factor to consider when configuring your load balancing solution. Let’s say you’ve got a pool of application instances and you know the capacity of each in terms of connections (X). Two of the servers can handle 500 concurrent connections, and one can handle 1000 concurrent connections. Now assume that your load balancer is configured to perform standardround robin load balancingbetween the three instances. Even though the total capacity of these three servers appears to be 2000 concurrent connections, by the time you hit 1501, the first of the three servers will be over capacity because it will have to try to handle 501 connections. If you tweak the configuration just a bit to indicate the maximum connection capacity for each node (instance) you can probably avoid this situation, but there are no guarantees. Now let’s make a small change in the algorithm – instead of standard round robin we’ll useweightedround robin (often called “ratio”), and give the largest capacity server a higher weight based on its capacity ratio to the other servers, say 2. This means the “bigger” server will receive twice as many requests as the other two servers, which brings the total capacity closer to what is expected. You might be thinking that aleast connection algorithmwould be more appropriate in a heterogeneous environment, but that’s not the case. Least connection algorithms base distribution upon the number of connections currently open on any given server instance; it does not necessarily take into consideration the maximum connection capacity for that particular node. Fastest response time combined with per node connection limits would be a better option, but afastest response time algorithmtends to result in a very unequal distribution as load increases in a heterogeneous environment. This does not, however, say anything about the performance of the application when using any of the aforementioned algorithms. We do know that as application instances near capacity performance tends to degrade. Thus we could extrapolate that the performance for the two “smaller” servers will degrade faster than the performance for the bigger server because they will certainly reach capacity under high load before the larger server instance – when using some algorithms, at least. Algorithms like fastest response time and least connections tend to favor higher performing servers which means in the face of a sudden spike of traffic performance may degrade usingthatalgorithm as well. How about more “dynamic” algorithms that take into consideration multiple factors? Dynamic load balancing methods are designed to work with servers that differ in processing speed and memory. The resulting load balancing decisions may be uneven in terms of distribution but generally provides a more consistent user experience in terms of performance. For example, theobserved dynamic load balancing algorithmdistributes connections across applications based on a ratio calculated every second, andpredictive dynamic load balancinguses the same ratio but also takes into consideration the change between previous connection counts and current connection counts and adjusts the ratio based on the delta. Predictive mode is more aggressive in adjusting ratio values for individual application instances based on connection changes in real-time and in a heterogeneous environment is likely better able to handle the differences between server capabilities. What is TCP multiplexing? TCP multiplexing is a technique used primarily by load balancers and application delivery controllers (but also by some stand-alone web application acceleration solutions) that enables the device to "reuse" existing TCP connections. This is similar to the way in which persistent HTTP 1.1 connections work in that a single HTTP connection can be used to retrieve multiple objects, thus reducing the impact of TCP overhead on application performance. TCP multiplexing allows the same thing to happen for TCP-based applications (usually HTTP / web) except that instead of the reuse being limited to only 1 client, the connections can be reused over many clients, resulting in much greater efficiency of web servers and faster performing applications. Interestingly enough, chatting withDan Bartow(now CloudTest Evangelist and Vice President atSOASTA) about his experiences as Senior Manager of Performance Engineering atIntuit, revealed that testing different algorithms under heavy load generated externally finally led them to the discovery that a simple round robin algorithm combined with the application ofTCP multiplexing optionsyielded a huge boost in both capacity and performance. But that was only after testing under conditions which were similar to those the applications would experience during peaks in usage and normalization of the server environment. This illustrates well that performance and availability isn’t simply a matter of dumping a load balancing solution into the mix – it’s important to test, to tweak configurations, and test again to find the overall infrastructure configuration that’s going to provide the best application performance (and thus end-user experience) while maximizing resource utilization. Theoretical mathematically accurate models of load balancing are all well and good, but in the real world the complexity of the variables and interaction between infrastructure solutions and applications and servers is much higher, rendering the “theory” just that – theory. Invariably which load balancing algorithm is right for your application is going to depend heavily on what metrics are most important to you. A balance of server efficiency, response time, and availability is likely involved, but which one of these key metrics is most important depends on what business stakeholders have deemed most important to them. The only way to really determine which load balancing algorithm will achieve the results you are looking for is totest them, under load, and observe the distribution and performance of the application. FIRE and FORGET NOT a GOOD STRATEGY The worst thing you can do is “fire and forget” about your load balancer. The algorithm that might be right for one application might not be right for another, depending on the style of application, its usage patterns, the servers used to serve it, and even the time of year. Unfortunately we’re not quite at the point where the load balancer can automatically determine the right load balancing algorithm for you, butthere are ways to adjust – dynamically – the algorithm based on not just the application but also the capabilities of the servers(physical and/or virtual) being load balanced so one day it is quite possible that through the magic of Infrastructure 2.0, load balancing algorithms will be modified on-demand based on the type of servers that make up the pool of resources. In order for the level of sophistication we’d (all) like to see, however, it’s necessary to first understand the impact of the load balancing algorithm on applications and determine which one is best able to meet the service level agreements in various environments based on a variety of parameters. This will become more important as public and private cloud computing environments are leveraged in new ways and introduce more heterogeneous environments. Seasonal demand might, for example, be met by leveraging different “sizes” of unused capacity across multiple servers in the data center. These “servers” would likely be of different CPU and RAM capabilities and thus would certainly be impacted by the choice of load balancing algorithm. Being able todynamically modify the load balancing algorithmbased on the capacities of application instances is an invaluable tool when attempting to maximize the efficiency of resources while minimizing associated costs. There is, of course, a lack of control over algorithms in cloud computing environments, as well, that make the situation more difficult. With a limited set of choices available from providers the algorithm that’s best for your application and server resource composition may not be available. Providers need to make it easier for customers to take advantage of modern, application and resource-aware algorithms that have evolved through trial-and-error over the past decade. Again, Infrastructure 2.0 enables this level of choice but must be leveraged by the provider to extend that choice and control to its customers. For now, it’s going to have to be enough to (1) thoroughly test the application and its supporting infrastructure under load and (2) adjust the load balancing algorithm to meet your specific performance criteria based on what is available. You might be surprised to find how much better your response time and capacity can be when you’re using the “right” load balancing algorithm for your application – or at least one that’s more right than it is wrong if you’re in a cloud computing environment.298Views0likes2CommentsImpact of Load Balancing on SOAPy and RESTful Applications
A load balancing algorithm can make or break your application’s performance and availability It is a (wrong) belief that “users” of cloud computing and before that “users” of corporate data center infrastructure didn’t need to understand any of that infrastructure. Caution: proceed with infrastructure ignorance at the (very real) risk of your application’s performance and availability. Think I’m kidding? Stefan’s SOA & Enterprise Architecture Blog has a detailed and very explanatory post on Load Balancing Strategies for SOA Infrastructures that may change your mind. This post grew, apparently, out of some (perceived) bad behavior on the part of a load balancer in a SOA infrastructure. Specifically, the load balancer configuration was overwhelming the very services it was supposed to be load balancing. Before we completely blame the load balancer, Stefan goes on to explain that the root of the problem lay in the load balancing algorithm used to distribute requests across the services. Specifically, the load balancer was configured to use a static round robin algorithm and to apply source IP address-based affinity (persistence) while doing so. The result is that one instance of the service was constantly sent requests while the others remained idle and available. Stefan explains how the load balancing algorithm was changed to utilize a dynamic ratio algorithm that takes into consideration the state of each service (CPU and memory available) and removed the server affinity requirement. The problem wasn’t the load balancer, per se. The load balancer was acting exactly as it was configured to act. The problem lay deeper: in understanding the interaction between the network, the application network, and the services themselves. Services, particularly stateless services as offered by SOA and REST-based APIs today, do not generally require persistence. In cases where they do require persistence, that persistence needs to be based on application-layer information, such as an API key or user (usually available in a cookie). But this problem isn’t unique to SOA. Consider, if you will, the effect that such an unaware distribution might have on any one of the popular social networking sites offering RESTful APIs for integration. Imagine that all Twitter API requests ended up distributed to one server in Twitter’s infrastructure. It would fall over quickly, no doubt about that, because the requests are distributed without any consideration for current load and almost, one could say, blindly. Stefan points this out as he continues to examine the effect of load balancing algorithms on his SOA infrastructure: “Secondly, the static round-robin algorithm does not take in effect, which state each cluster node has. So, for example if one cluster node is heavily under load, because it processes some complex orders, and this results in 100% cpu load, then the load balancer will not recognize this but route lots of other requests to this node causing overload and saturation.” Load balancing algorithms that do not take into account the current state of the server and application, i.e. they are not context-aware, are not appropriate for today’s dynamic application architectures. Such algorithms are static, brittle, and blind when it comes to distributed load efficiently and will ultimately result in an uneven request load that is likely to drive an application to downtime. THE APPLICATION SHOULD BE A PART OF THE ALGORITHM It is imperative in a distributed application architecture like SOA or REST that the application network infrastructure, i.e. the load balancer, be able to take into consideration the current load on any given server before distributing a request. If one node in the (pool|farm|cluster) is processing a complex order that consumes most of the CPU resources available, the load balancer should not continue to send it requests. This requires that the load balancer, the application delivery controller, be aware of the application, its environment, as well as the network and the user. It must be able to make a decision, in real-time, about where to direct any given request based on all the variables available. That includes CPU resources, what the request is, and even who the user/application is. For example, Twitter uses a system of inbound rate limiting on API calls to help manage the load on its infrastructure. Part of that equation could be the calling application. HTTP as a transport protocol contains a somewhat surprisingly rich array of information in its headers that can be parsed and inspected and made a part of the load balancing equation in any environment. This is particularly useful to sites like Twitter where multiple “applications” (clients) are making use of the API. Twitter can easily require the use of a custom HTTP header that includes the application name and utilize that as part of its decision making processes. Like RESTful APIs, SOAP envelopes are full of application specifics that provide data to the load balancer, if it’s context-aware, that can be utilized to determine how best to distribute a request. The name of the operation being invoked, for example, can be used to not only load balance at the service level, but at the operation level. That granularity can be important when operations vary in their consumption of resources. This application layer information, in conjunction with current load and connections on the server provide a wealth of information as to how best, i.e. most efficiently, to distribute any given request. But if the folks in charge of configuring the load balancer aren’t aware of the impact of algorithms on the application and its infrastructure, you can end up in a situation much like that described in Stefan’s blog on the subject. CLOUD WILL MAKE THIS SITUATION WORSE Cloud computing won’t solve this problem and, in fact, it will probably make it worse. The belief that the infrastructure should be “hidden” from the user (that’s you) means that configuration options – like the load balancing algorithm – aren’t available to you as a user/deployer of cloud-based applications. Even though load balancing is going to be used to scale your application, you have no clue or control over how that’s going to occur. That’s why it’s important that you ask questions of your provider on this subject. You need to know what algorithm is being used and how requests are distributed so you can determine how that’s going to impact your application and its performance once its deployed. You can’t – or shouldn’t – assume that the load balancing provided is going to magically distribute requests perfectly across your scaled application because it wasn’t configured with your application in mind. If you deploy an application – particularly a SOA or RESTful one – you may find that with scalability comes poor performance or even unavailable applications because of the configuration of that infrastructure you “aren’t supposed to worry about.” Applications are not islands; they aren’t deployed stand-alone even though the virtualization of applications is making it seem like that’s the case. The delivery of applications requires collaboration between a growing number of components in the data center and load balancing is one of the key components that can make or break your application’s performance and availability. Five questions you need to ask about load balancing and the cloud Dr. Dobb’s Journal: Coding in the Cloud Cloud Computing: Vertical Scalability is Still Your Problem Server Virtualization versus Server Virtualization SOA & Web 2.0: The Connection Management Challenge The Impact of the Network on AJAX Have a can of Duh! It’s on me Intro to Load Balancing for Developers – The Algorithms Not All Virtual Servers are Created Equal614Views0likes0Comments