The Cloud Configuration Management Conundrum

Making the case for a stateless infrastructure model.

cloud computing appears to have hit a plateau with respect to infrastructure services. We simply aren’t seeing even a slow and steady offering by providers of the infrastructure services needed to deploy mature enterprise-class applications. An easy answer as to why this is the case can be found in the fact that many infrastructure services while themselves commoditized are not standardized. That is, while the services are common to just about every data center infrastructure the configuration, policies and APIs are not. But this is somewhat analogous to applications, in that the OS and platforms are common but the applications deployed on them are not. And we’ve solved that problem, so why not the problem of deploying unique infrastructure policies and configuration too?

It’s that gosh-darned multi-tenant thing again. It’s not that infrastructure can’t or doesn’t support a multi-tenant model; plenty of infrastructure providers are moving to enable or already support a multi-tenant deployment model. The problem is deeper than that, at the heart of the infrastructure. It’s configuration management, or more precisely, the scale of configuration management necessary in a cloud computing environment.

CONFIGURATION MANAGEMENT

It’s one thing to enable the isolation of of network device configurations – whether in hardware or software form factor – as a means to offer infrastructure services in a multi-tenant environment. But each of the services requires a customer-specific configuration and, in many cases, per-application. Add to that the massive number of “objects” that must be managed on the infrastructure for every configuration and it takes little math skills to realize that the more customers and instances, the more configuration “objects”. At some point, it becomes too unwieldy and expensive a way to manage infrastructure services for the provider. A shared resource requires a lot of configuration regardless of the multi-tenant model. If you use virtualization to create a multi-tenant environment on a device using, then the support is either broad (one instance per customer) or deep (multiple configurations per instance). In either case, you end up with configurations that are almost certainly going to increase parallel to the increase in customers. Thousands of customers = many thousands of configuration objects. Many infrastructure devices have either imposed or have best practice limitations on the number of configuration “objects” it can support.

As a simple, oft-referenced, example consider the VLAN limitation. There is a “theoretical” maximum of 4096 VLANS per trunk. It is theoretical because of vendor-imposed limitations that are generally lower due to performance and other configuration-related constraints. The 4096, like most upper limiting numbers in networking, is based on primitive numerical types in programming languages comprised of a certain number of bits and/or bytes that impose a mathematical upper bound. These bounds are not easily modified without changes to the core software/firmware. These limitations are also, when related to transmission of data, constrained by specifications such as those maintained by the IEEE that define the data format of information exchanged via the network. Obviously changing any one of the pieces of data would change the size of the data (because they are strictly defined in terms of ranges of bit/byte locations in which specific data will be found) and thus require every vendor to change their solutions to be able to process the new format. This is at the heart of the problem with IPv4 –> IPv6 migration: the size of IP packet headers changes as does the location of data critical to the processing of packets. This is also true of configuration stores, particularly on networking devices where size and performance are critical. The smallest data type possible for the platform is almost always used to ensure performance is maintained. Thus, configuration and processing is highly dependent on tightly constrained specifications and highly considerate of speed, making changes to these protocols and configuration systems fraught with risk.

This is before we even begin considering the run-time impact in terms of performance and potential disruption due to inherent volatility in the cloud “back plane”, a.k.a. server/instance infrastructure. Consider the way in which iptables is used is some cloud computing environments. The iptable configuration is dynamically updated based on customer configuration. As the configuration grows and it takes longer for the system to match traffic against the “rules” and apply them, performance degrades. It’s also an inefficient means of managing a multi-tenant environment, as the process of inserting and removing rules dynamically is also a strain on the entire system because software-only solutions rarely differentiate between management and data plane activity.

So basically what we have is a configuration management problem that may be stifling movement forward on more infrastructure services being offered as, well, services. The load from management of the massive configurations necessary would almost certainly overwhelm the infrastructure.

So what’s the solution?

OPTIONS

The most obvious and easiest from a provider standpoint is architectural multi-tenancy. Virtual network appliances. It decreases the configuration management burden and places the onus on the customer. This option is, however, more complex because of the topological constraints imposed by most infrastructure, a.k.a. dependency on a stable IP address space, and thus becomes a management burden on the customer, making it less appealing from their perspective to adopt. A viable solution, to be sure, but one that imposes management burdens that may negatively impact operational and financial benefits.

A second solution is potentially found in OpenFlow. Using its programmable foundation to direct flows based on some piece of data other than IP address could alleviate the management burden imposed by topological constraints and allow architectural multi-tenancy to become the norm in cloud computing environments. This solution may, however, require some new network or application layer protocol – or extension of an existing protocol – to allow OpenFlow to identify, well, the flow and direct it appropriately. This model does not necessarily address the issue of configuration management. If one is merely using OpenFlow to direct traffic to a controller of some sort, one assumes the controller would need per-flow or per-application or per-customer configuration. Thus such a model is likely run into the same issues with scale of configuration management unless such extensions or new protocols are self-describing. Such self-describing meta-data included in the protocol would allow interpretation on the OpenFlow-enabled component rather than lookup of configuration data.

Basically, any extension of a protocol or new protocol used to dynamically direct flows to the appropriate services would need to be as stateless as possible, i.e. no static configuration required on the infrastructure components beyond service configuration. A model based on this concept does not necessarily require OpenFlow, although it certainly appears that a framework such as offered by the nascent specification is well-suited to enabling and supporting such an architecture with the least amount of disruption. However, as with other suggested specification-based solutions, no such commoditized or standardized meta-data model exists.

The “control plane” of a data center architecture, whether in cloud computing or enterprise-class environments, will need a more scalable configuration management approach if it is to become service-oriented whilst simultaneously supporting the dynamism inherent in emerging data center models. We know that stateless architectures are the pièce de résistance of a truly scalable application architecture. RESTful application models, while not entirely stateless yet, are modeled on that premise and do their best to replicate a non-dependence on state as a means to achieve highly scalable applications. The same model and premise should also be applied to infrastructure to achieve the scale of configuration management necessary to support Infrastructure as a Service, in the pure sense of the word. OpenFlow appears to have the potential to support a stateless infrastructure, if a common model can be devised upon which an OpenFlow-based solution could act independently.

A stateless infrastructure, freed of its dependency on traditional configuration models, would ultimately scale as required by a massively multi-tenant environment and could lay the foundation for true portability of ~~applications~~ architectures across cloud and data center environments.