Forum Discussion

Zhinjio_101470's avatar
Zhinjio_101470
Icon for Nimbostratus rankNimbostratus
Sep 22, 2011

Active/Passive NODE question

Hey folks,

 

 

Having a friendly argument with a coworker about the best way to implement a particular failover scenario.

 

 

General Behavior Definition:

 

 

We want one node to be receiving all traffic until it fails.

 

Failover to the secondary node should be automated and quick.

 

Once a node is "promoted" to active, it should remain that way, even if the old node comes back into service.

 

A failed node returning to service (monitoring succeeds) should be a MANUAL step.

 

We want to minimize the work required to return a failed node into service again (but it should not be completely automated).

 

 

We currently have two different methods we're implementing this, but it seems to me there should be a better way than either of these methods:

 

-------------------

 

Method 1:

 

Round Robin, Priority Pool (nodes set to, say ... 3 and 5) (Less than 1)

 

Monitor set "Manual Resume" to "YES"

 

No persistence profile

 

 

When active node fails, secondary node is promoted due to load balancing method. Recovered node will NOT return to service until manually checked back to "Enabled". At that time, I would also adjust its Priority to be LOWER than the active node so it will not receive traffic until the newly promoted node fails. Enable the node. All is well.

 

------------------

 

Method 2:

 

Round Robin, Priority Pool (nodes set to 3 and 5) (Less than 1)

 

Monitor set "Manual Resume" to "YES"

 

Persistence profile set to "Dest.Addr Affinity" and Timeout set to "Indefinite"

 

 

Similar to above, active node fails, secondary node is promoted. Recovered node will not return to service until manually checked back to Enabled. UNLIKE the above method though, it will not receive traffic due to the persistence profile and timeout setting.

 

--------------------

 

 

So it seems to me that with the second method, you are setting the priority pool settings, but not really "taking advantage" of them. The only purpose is really serves is to ensure that only one node is getting traffic, but then ignoring it with the persistence profile. The one advantage of the second method is that I don't have to muck with priority values when I bring a node back into service.

 

 

Either way, it seems that either of these methods are kind of "hacks" and it is really just cover for the fact that there is no "real" active/passive node implementation that covers our requirements.

 

 

Y'all are smart. Are we missing some feature or method for doing this that would still cover our requirements?

 

 

Thanks in advance,

 

- ZJ

5 Replies

  • Hi ZJ,

     

    Here is a method that i used in the past

    http://devcentral.f5.com/wiki/iRules.SingleNodePersistence.ashx

     

     

     

    You can alter it so that you can toggle to turn on persistance for one node over the other via HTTP request.

     

     

    That is the beauty of the F5 in that you can customize active/passive methods without relying a vendors version of what a "real" active/passive configuration

     

     

     

     

    I hope this helps

     

     

    Bhattman

     

  • That is a great solution. However... you noted:

     

     

    Doesn't offer the capability of manual resume after failure, or true designation of a "primary" and "secondary" instance (sometimes required for db applications)...

     

    That is, in fact, the situation we're talking about, even though I hadn't mentioned it above. When a node fails, a DBA will check the failed node, verify its health, ensure there are no collision transactions, turn on replication again, and verify everything is healthy. We want to then "enable" the node again, but not take traffic. Ideally, the fewer times we "flip flop" the better.

     

     

    The real question here is of solution "elegance", I guess.
  • HI Zhinjio,

     

    In that point then "elegance" would be that you want a built-in functionality rather then a customized one.

     

     

    You can still use the iRules version with another script (iControl method) which can be executed by the DBA to be fail the persistance over to the other node when they are ready.

     

     

    Bhattman
  • You could try using Manual Resume on the monitor. From the online help:

     

     

     

    Specifies whether the system automatically changes the status of a resource to Enabled at the next successful monitor check. If you set this option to Yes, you must manually re-enable the resource before the system can use it for load balancing connections. The default is No.

     

     

    * Yes: Specifies that you must manually re-enable the resource after an unsuccessful monitor check.

     

    * No: Specifies that the system automatically changes the status of a resource to Enabled at the next successful monitor check.

     

     

     

    Another option would be to use a db monitor to make a SQL query to a table that only returns a successful response if that SQL node is active.

     

     

    Aaron
  • Posted By hoolio on 09/27/2011 10:47 AM

     

    You could try using Manual Resume on the monitor. From the online help:

     

     

     

    Specifies whether the system automatically changes the status of a resource to Enabled at the next successful monitor check. If you set this option to Yes, you must manually re-enable the resource before the system can use it for load balancing connections. The default is No.

     

     

    * Yes: Specifies that you must manually re-enable the resource after an unsuccessful monitor check.

     

    * No: Specifies that the system automatically changes the status of a resource to Enabled at the next successful monitor check.

     

     

     

    Another option would be to use a db monitor to make a SQL query to a table that only returns a successful response if that SQL node is active.

     

     

    Aaron

     

    Exactly. We are doing both of those things. Specifically, the monitor is "External". It ultimately runs a sql query against the database, the results of which determines whether I the monitor spits out "UP" (its alive) or does nothing (its not).

     

     

    Recovery is manual on the failed node. The question is in how you bring the failed node back into service.

     

     

    Method 1 requires rejiggering node priorities when you re-enabled the node to make sure it does not start taking traffic.

     

     

    Method 2 requires just re-enabling the node, but it also ignores node priorities, and also will swap nodes again if you were to restart the F5.

     

     

    I guess its sort of a moot point since there are manual steps in either case. It just seems that both options are "inelegant", and I was looking for some other solution that we might not have considered.

     

     

    I do appreciate the conversation, though. It at least validates that we didn't miss something obvious.

     

     

    - ZJ