The Impossibility of CAP and Cloud

It comes down to this: the on-demand provisioning and elastic scalability systems that make up “cloud” are addressing NP-Complete problems for which there is no known exact solutions.

At the heart of what cloud computing provides – in addition to compute-on-demand – is the concept of elastic scalability. It is through the ability to rapidly provision resources and applications that we can achieve elastic scalability and, one assumes, through that high availability of systems. Obviously, given my relationship to F5 I am strongly interested in availability. It is, after all, at the heart of what an application delivery controller is designed to provide. So when a theorem is presented that basically says you cannot build a system that is Consistent, Available, and Partition-Tolerant I get a bit twitchy.

Just about the same time that Rich Miller was reminding me of Brewer’s CAP Theorem someone from HP Labs claimed to have solved the P ≠ NP problem (shortly thereafter determined to not be a solution after all), which got me thinking about NP-Completeness in problem sets, of which solving the problem of creating a distributed CAP-compliant system certainly appears to be a member.

CLOUD RESOURCE PROVISIONING is NP-COMPLETE

A core conflict with cloud and CAP-compliance is on-demand provisioning. There are, after all, a minimal set of resources available (cloud is not infinitely scalable, after all) with, one assumes, each resource having a variable amount of compute availability. For example, most cloud providers use a “large”, “medium”, and “small” sizing approach to “instances” (which are, in almost all cases, a virtual machine). Each “size” has a defined set of reserved compute (RAM and CPU) for use. Customers of cloud providers provision instances by size.

At first glance this should not a problem. The provisioning system is given an instruction, i.e. “provision instance type X.” The problem begins when you consider what happens next – the provisioning system must find a hardware resource with enough capacity available on which to launch the instance.

In theory this certainly appears to be a variation of the Bin packing problem (which is NP-complete). It is (one hopes) resolved by the cloud provider by removing the variability of location (parameterization) or the use of approximation (using the greedy approximation algorithm “first-fit”, for example). In a pure on-demand provisioning environment, the management system would search out, in real-time, a physical server with enough physical resources available to support the requested instance requirements but it would also try to do so in a way that minimizes the utilization of physical resources on each machine so as to better guarantee availability of future requests and to be more efficient (and thus cost-effective).

Brewer’s CAP Theorem

It is impractical, of course, to query each physical server in real-time to determine an appropriate location, so no doubt there is a centralized “inventory” of resources available that is updated upon the successful provisioning of an instance. Note that this does not avoid the problem of NP-Completeness and the resulting lack of a solution as data replication/synchronization is also an NP-Complete problem. Now, because variability in size and an inefficient provisioning algorithm could result in a fruitless search, providers might (probably do) partition each machine based on the instance sizes available and the capacity of the machine. You’ll note that most providers size instances as multiples of the smallest, if you were looking for anecdotal evidence of this. If a large instance is 16GB RAM and 4 CPUs, then a physical server with 32 GB of RAM and 8 CPUs can support exactly two large instances. If a small instance is 4GB RAM and 1 CPU, that same server could ostensibly support a combination of both: 8 small instances or 4 small instances and 2 large instances, etc… However, that would make it difficult to keep track of the availability of resources based on instance size and would eventually result in a failure of capacity availability (which makes the system non-CAP compliant). However, not restricting the instances that can be deployed on a physical server returns us to a bin packing-like algorithm that is NP-complete which necessarily introduces unknown latency that could impact availability. This method also introduces the possibility that while searching for an appropriate location some other consumer has requested an instance that is provisioned on a server that could have supported the first consumer’s request, which results in a failure to achieve CAP-compliance by violating the consistency constraint (and likely the availability constraint, as well).

The provisioning will never be “perfect” because there is no exact solution to an NP-complete problem. That means the solution is basically the fastest/best it can be given the constraints. Which we often distill down to “good enough.” That means that there are cases where either availability or consistency will be violated, making cloud in general non-CAP compliant.

The core conflict is the definition of “highly available” as “working with minimal latency.” Or perhaps the real issue is the definition of “minimal”. For it is certainly the case that a management system that leverages opportunistic locking and shared data systems could alleviate the problem of consistency, but never availability. Eliminating the consistency problem by ensuring that every request has exclusive access to the “database” of instances when searching for an appropriate physical location introduces latency while others wait. This is the “good enough” solution used by CPU schedulers – the CPU scheduler is the one and only authority for CPU time-slice management. It works more than well-enough on a per-machine basis, but this is not scalable and in larger systems would result in essentially higher rates of non-availability as the number of requests grows.

WHY SHOULD YOU CARE

Resource provisioning and job scheduling in general are in the class of NP-complete problems. While the decision problem to choose an appropriate physical server on which to launch a set of requested instances can be considered an instantiation of the Bin packing problem, it can also be viewed as a generalized assignment problem or, depending on the parameters, a variation of the Knapsack problem, or any one of the multiprocessor scheduling problems, all of which are NP-complete. Cloud is essentially the integration of systems that provide resource provisioning and may include job scheduling as a means to automate provisioning and enable a self-service environment. Because of its reliance on problems that are NP-complete we can deduce that cloud is NP-complete.

NOTE: No, I’m not going to provide a formal proof. I will leave that to someone with a better handle on the reductions necessary to prove (or disprove) that the algorithms driving cloud are either the same or derivations of existing NP-Complete problem sets.

The question “why should I care if these problems are NP-Complete” is asked by just about every student in every algorithms class in every university there is. The answer is always the same: because if you can recognize that a problem you are trying to solve is NP-Complete you will not waste your time trying to solve a problem that thousands of mathematicians and computer scientists have been trying to solve for 50 years and have thus far not been able to do so. And if you do solve it, you might want to consider formalizing it, because you’ve just proved P = NP and there’s a $1,000,000 bounty out on that proof. But generally speaking, it’s a good idea to recognize them when you see them because you can avoid a lot of frustration by accepting up front you can’t solve it, and you can also leverage existing research / algorithms that have been proposed as alternatives (approximation algorithms, heuristics, parameterized algorithms, etc…) to get the “best possible” answer and get on with more important things.

It also means there is no one optimal solution to “cloud”, only a variety of “good enough” or “approximately optimal” solutions. Neither the time required to provision can be consistently guaranteed or the availability of resources in a public cloud environment. This is, essentially, why the concept of reserved instances exists. Because if your priorities include high availability, you’d better consider budgeting for reserved instances, which is basically a more cost effective method of having a whole bunch of physical servers in your pool of available resources on stand-by.

But if your priorities are geared toward pinching of pennies, and availability is lower on your “must have” list of requirements, then reserving instances is an unnecessary cost – as long as you’re willing to accept the possibility of lower availability.

Basically, the impossibility of achieving CAP in cloud impacts (or should impact) your cloud computing strategy – whether you’re implementing locally or leveraging public resources. As I mentioned very recently – cloud is computer science, and if you understand the underlying foundations of the systems driving cloud you will be much better able to make strategic decisions regarding when and what type of cloud is appropriate and for what applications.

Published Sep 01, 2010

Version 1.0