Forum Discussion

Florin_Andrei_1's avatar
Florin_Andrei_1
Icon for Nimbostratus rankNimbostratus
Jan 21, 2010

GTM, active/passive and split-brain

The situation:

 

A pair of redundant sites, functionally equivalent (hardware-wise too, almost identical), geographically separated.

 

A private DS3 line in between, used to synchronize data from one site to another.

 

Site architecture is pretty typical: GTM for site-to-site switching, firewalls, pair of LTMs for local load balancing, webservers, other systems, storage.

 

The two sites are active/passive (they cannot be both active at the same time), mostly due to the database architecture. The active site is synchronizing its fresh data to the backup site in near-real-time over the private line.

 

Each GTM monitors its local site, and a site switch decision is made if the current active site has a problem (let's say, the storage went offline for some reason).

 

 

The problem:

 

Typical for an active/passive design, a split-brain situation can be pretty bad. We don't want both sites to become active at the same time. It's better to leave that decision to a human operator (the site's reliability is only truly critical a few hours each week, during which time it is very closely monitored).

 

 

So, basically, we want this behavior:

 

- flip the active and backup roles if both GTMs can see each other and the current active site detects a local fault

 

- preserve the current state and wait for the human operator to make a decision if the connection between the two GTM units is lost

 

 

I'm pretty new to the GTM so I'm wondering, is this doable at all?
  • I will give you my thoughts on this for each behavior you are asking about:

     

    1. Split brain is a bit difficult to enforce and you have to be willing to get creative, but ultimately you could do things like add an additional TCP monitor to the pools (either GTM or LTM) that reference the IP and 443 management port of the GTMs directly (Alias Address and Alias Service Port respectively), also look at the minimum monitors up, as well as reversing.

     

    2. Preserving the current state is pretty easy when using the "manual resume" option in the monitor. Once the monitor is tripped the pool member will stay down until you manually re-enable the pool member. I see manual resume used quite a bit on database virtual servers.

     

    As a bonus! If you have some sort of witness server, you might want to look at creating a monitor that queries the witness server for service availability, and/or using the appropriate database monitor to query a value in a table that reflects the operational state of the database.