Load Aware Fabrics

#cloud Heterogeneous infrastructure fabrics are appealing but watch out for the gotchas

One of the "rules" of application delivery (and infrastructure in general) has been that when scaling out such technologies, all components must be equal. That started with basic redundancy (deploying two of everything to avoid a single point of failure in the data path) and has remained true until recently.

Today, fabrics can be comprised of heterogeneous components. Beefy, physical hardware can be easily paired with virtualized or cloud-hosted components. This is good news for organizations seeking the means to periodically scale out infrastructure without oversubscribing the rest of the year, leaving resources idle.

Except when it's not so good, when something goes wrong and there's suddenly not enough capacity to handle the load because of the disparity in component capacity.

We (as in the industry) used to never, ever, ever suggest running active-active infrastructure components when load on each component was greater than 50%. The math easily shows why:

It's important to note that this scenario isn't just a disaster (failure) based scenario. This is true for maintenance, upgrades, etc... as well. This is why emerging fabric-based models should be active-active-N. That "N" is critically important as a source of resources designed to ensure that the "all not so good" scenario is covered.

This fundamental axiom of architecting reliable anything - always match capacity with demand - is the basis for understanding the importance of load-aware failover and distribution in fabric-based architectures.

In most HA (high availability) scenarios the network architect carefully determines the order of precedence and failover. These are pre-determined, there's a primary and a secondary (and a tertiary, and so on). That's it. It doesn't matter if the secondary is already near or at capacity, or that it's a virtualized element with limited capacity instead of a more capable piece of hardware. It is what it is.

And that "is" could be disastrous to availability. If that "secondary" isn't able to handle the load, users are going to be very angry because either responsiveness will plummet to the point the app might as well be unavailable or it will be completely unavailable. In either case, it's not meeting whatever SLA has been brokered between IT and the business owner of that application.

That's why it's vitally important as we move toward fabric-based architectures that failover and redundancy get more intelligent. That the algorithms used to distribute traffic across the fabric get very, very intelligent. Both must become load aware and able to dynamically determine what to do in the event of a failure. The fabric itself ought to be aware of not just how much capacity each individual component can handle but how much it currently is handling, so that if a failure occurs or performance is degrading it can determine dynamically which component (or components, if need be) can take over more load. In the future, that intelligence might also enable the fabric to spin up more resources if it recognizes there's just not enough.

As we continue to architect "smarter" networks, we need to re-evaluate existing technology and figure out how it needs to evolve, too, to fit into the new, more dynamic and efficiency-driven world.

It's probably true that failover technologies and load balancing algorithms aren't particularly exciting to most people, but they're a necessary and critical function of networks and infrastructure designed to ensure high-availability in the event of (what many would call inevitable) failure. So as network and application service technologies evolve and transform, we've got to be considering how to adapt foundational technologies like failover models to ensure we don't lose the stability necessary to continue evolving the network.

Published Dec 02, 2013

Version 1.0