The Number of the Counting Shall be Three (Rules of Thumb for Application Availability)

Three shall be the number thou shalt count, and the number of the counting shall be three. If you’re concerned about maintaining application availability, then these three rules of thumb shall be the number of the counting. Any less and you’re asking for trouble.

I like to glue animals to rocks and put disturbing amounts of

electricity and saltwater NEXT TO EACH OTHER

Last week I was checking out my saltwater reef when I noticed water lapping at the upper edges of the tank. Yeah, it was about to overflow. Somewhere in the system something had failed. Not entirely, but enough to cause the flow in the external sump system to slow to a crawl and the water levels in the tank to slowly rise.

Troubleshooting that was nearly as painful as troubleshooting the cause of application downtime. As with a data center, there are ingress ports and egress ports and inline devices (protein skimmers) that have their own flow rates (bandwidth) and gallons per hour processing capabilities (capacity) and filtering (security). When any one of these pieces of the system fails to perform optimally, well, the entire system becomes unreliable, instable, and scary as hell. Imagine a hundred or so gallons of saltwater (and all the animals inside) floating around on the floor. Near electricity.

The challenges to maintaining availability in a marine reef system are similar to those in an application architecture. There are three areas you really need to focus on, and you must focus on all three because failing to address any one of them can cause an imbalance that may very well lead to an epic fail.

RELIABILITY

Reliability is the cornerstone of assuring application availability. If the underlying infrastructure – the hardware and software – fails, the application is down. Period. Any single point of failure in the delivery chain – from end-to-end – can cause availability issues. The trick to maintaining availability, then, is redundancy. It is this facet of availability where virtualization most often comes into play, at least from the application platform / host perspective. You need at least two instances of an application, just in case. Now, one might think that as long as you have the capability to magically create a secondary instance and redirect application traffic to it if the primary application host fails that you’re fine. You’re not. Creation, boot, load time…all impact downtime and in some cases, every second counts. The same is true of infrastructure. It may seem that as long as you could create, power up, and redirect traffic to a virtual instance of a network component that availability would be sustained, but the same timing issues that plague applications will plague the network, as well. There really is no substitute for redundancy as a means to ensure the reliability necessary to maintain application availability. Unless you find prescient, psychic components (or operators) capable of predicting an outage at least 5-10 minutes before it happens. Then you’ve got it made.

Several components are often overlooked when it comes to redundancy and reliability. In particular, internet connectivity is often ignored as a potential point of failure or, more often the case, it is viewed as one of those “things beyond our control” in the data center that might cause an outage. Multiple internet connections are expensive, understood. That’s why leveraging a solution like link load balancing makes sense. If you’ve got multiple connections, why not use them both and use them intelligently – to assist in efforts to maintain/improve application performance or prioritize application traffic in and out of the data center. Doing so allows you to assure availability in the event that one connection fails, yet the connection never sits idle when things are all hunky dory in the data center.

The rule of thumb for reliability is this: Like Sith lords, there should always be two of everything with automatic failover to the secondary if the primary fails (or is cut down by a Jedi knight).

CAPACITY

The most common cause of downtime is probably a lack of capacity. Whether it’s due to a spike in usage (legitimate or not) or simply unanticipated growth over time, a lack of compute resources available across the application infrastructure tiers is usually the cause of unexpected downtime. This is certainly one of the drivers for cloud computing and rapid provisioning models – external and internal – as it addresses the immediacy of need for capacity upon availability failures. This is particularly true in cases where you actually have the capacity – it just happens to reside physically on another host. Virtualization and cloud computing models allow you to co-opt that idle capacity and give it to the applications that need it, on-demand. That’s the theory, anyway. Reality is that there are also timing issues around provisioning that must be addressed but these are far less complicated and require fewer psychic powers than predicting total failure of a component. Capacity planning is as much art as science, but it is primarily based on real numbers that can be used to indicate when an application is nearing capacity. Because of this predictive power of monitoring and data, provisioning of additional capacity can be achieved before it’s actually needed.

Even without automated systems for provisioning, this method of addressing capacity can be leveraged – the equations for when provisioning needs to begin simply change based on the amount of time needed to manually provision the resources and integrate it with the scalability solution (i.e. the Load balancer, the application delivery controller).

The rule of thumb for capacity is this: Like interviews and special events, unless you’re five minutes early provisioning capacity you’re late.

SECURITY

Security – or lack thereof - is likely the most overlooked root cause of availability issues, especially in today’s hyper-connected environments. Denial of service attacks are just that, an attempt to deny service to legitimate users, and they are getting much harder to detect because they’ve been slowly working their way up the stack. Layer 7 DDoS attacks are particularly difficult to ferret out as they don’t necessarily have to even be “fast”, they just have to chew up resources.

Consider the latest twist on the SlowLoris attack; the attack takes the form of legitimate POST requests that s-l-o-w-l-y feed data to the server, in a way that consumes resources but doesn’t necessarily set off any alarm bells because it’s a completely legitimate request. You don’t even need a lot of them, just enough to consume all the resources on web/application servers such that no one else can utilize them. Leveraging a full proxy intermediary should go quite a ways to mitigate this situation because the request is being fed to the intermediary, not the web/application servers, and the intermediary generally has more resources and is already well versed in dealing with very slow clients. Resources are not consumed on the actual servers and it would take a lot (generally hundreds of thousands to millions) of such requests to consume the resources on the intermediary. The reason such an attack works is because the miscreants aren’t using many connections, so it’s likely that in order to take out a site front-ended by such an intermediary enough connections to trigger an alert/notification would be necessary.

Disclaimer:I have not tested such a potential solution so YMMV. In theory, based on how the attack works, the natural offload capabilities of ADCs should help mitigate this attack.

But I digress, the point is that security is one of the most important facets of maintaining availability. It isn’t just about denial of service attacks, either, or even consuming resources. A well-targeted injection attack or defacement can cripple the database or compromise the web/application behavior such that the application no longer behaves as expected. It may respond to requests, but what it responds with is just as vital to “availability” as responding at all. As such, ensuring the integrity of application data and applications themselves is paramount to preserving application availability.

The rule of thumb for security is this: If you build your security house out of sticks, a big bad wolf will eventually blow it down.

Assuring application availability is a much more complex task than just making sure the application is running. It’s about ensuring enough capacity exists at the right time to scale on demand; it’s about ensuring that if any single component fails another is in place to take over, and it’s absolutely about ensuring that a lackluster security policy doesn’t result in a compromise that leads to failure. These three components are critical to the success of availability initiatives and failing to address any one of them can cause the entire system to fail.