WILS: Virtualization, Clustering, and Disaster Recovery

#virtualization Clustering is local. Disaster recovery is global.

There are two levels of reliability for an application. There’s local and there’s global. We might want to consider it more simply as “inside” and “outside” reliability.

Virtualization enables local reliability – the inside kind of reliability. Whether you’re relying upon clustering or load balancing (each has advantages and disadvantages, but for purposes of reliability and this discussion we’ll assume equal capabilities) to provide the abstraction isn’t as important as recognizing that in terms of reliability you’re acting at the local, i.e. inside, level.

A cluster or pool, in load balancing parlance, is able to maintain local reliability by distributing load across multiple instances of the application. We can transparently add or remove instances to achieve the elasticity necessary to meet demand, thus ensuring reliability. In the event of a local disaster, such as the failure of a virtual machine, we can take the failed instance out of the rotation and even provision another to replace it.

What clustering (load balancing) can’t do is address global reliability, i.e. outside reliability. Global reliability must be addressed using a different technology, normally referred to as Global Server Load Balancing (GLSB). The terminology grew out of the days when global reliability was achieved by load balancing individual servers across the globe to ensure a failure in the network or at a specific location could not interrupt the service. As demand grew, GSLB performed the same functions, but did so at a site level, essentially load balancing sites instead of individual servers. The name remains, however confusing that may be to the uninitiated.

To achieve global reliability you need GSLB. To avoid the detrimental effects of a disaster in the network or at the site level, you must be able to direct users to an active location. This is realized in most implementations through simple DNS load balancing techniques; i.e. when a user makes a request the GSLB service responds with the IP address of an appropriate, active site. GLSB is capable of much more complex decision making, however, and decisions can be based on a variety of business and operational parameters, at the discretion of the organization. The GSLB service monitors each of the local sites, and is able to detect an outage within seconds and begin directing users elsewhere.

At the local level, clustering and load balancing also monitor the “health” of individual instances and can react similarly in the event of a failure, but do so only at the local level. If the site fails, as might be the case in the event of a disaster, the local service is unable to do anything about it. It can’t redirect globally, it can’t notify other components. It’s just gone.

For disaster recovery purposes, this is important stuff. When cloud first drifted onto the scene is was postulated that the cheaper compute would make implementing secondary data centers specifically for disaster recovery purposes more financially feasible for a wider variety of organizations. While that’s true in the sense that it’s way cheaper than building a secondary data center, many of the technological foundations remain the same: GSLB and a replicated environment.

Some folks balk at the replication and point to transparent migration as a solution. After all, why pay even pennies on the hour instances that may never be put into commission? The problem is that transparent migration of virtual machines is only useful while the VMs are live and running. If they aren’t, such as might be the case in the event of a disaster, the site can’t be replicated and global reliability fails. A cluster-to-cluster failover via a bridged network to the cloud might sound like a good idea, but it isn’t practical when applied to a disaster recovery scenario. Too much depends on the availability of the site, of the network, and of the clustering/load balancing mechanism itself. If any one of the components has failed, global reliability is unrealizable.

To achieve true global reliability regardless of the involvement of cloud computing , you’re going to need to implement a good old-fashioned GSLB architecture, complete with the network components and replicated application infrastructure. Local reliability (inside) may be achievable with virtual clustering solutions, but global reliability requires a very different architecture and set of technologies.

Disaster recovery strategies cannot rely on local reliability, they must be based on global reliability.

WILS: Write It Like Seth. Seth Godin always gets his point across with brevity and wit. WILS is an ATTEMPT TO BE concise about application delivery TOPICS AND just get straight to the point. NO DILLY DALLYING AROUND.