Forum Discussion

Stefan_Klotz_85's avatar
Dec 11, 2014

Standby LB becomes Active during HA software update

Normally a software update within the same major release and especially just a Hotfix update is a very easy and less critical task without any downtime. We are using the following steps:

 

  • relicense the Standby device
  • install the new software and/or hotfix into different partition
  • activate that partition
  • verify if everything is loaded correctly
  • perform a manual failover
  • perform a UAT
  • if everything is working correctly, perform all above six steps on the second device

What we noticed is, that sometimes the Standby device will become Active after rebooting into the new installed partition. This might cause unexpected impact/downtime, if something is not loaded correctly with the new software/hotfix. We are not using "Redundancy State Preference". I would expect, that if I reboot a Standby device, that it will always come up as Standby.

 

I know that if both devices of a cluster will be Active (e.g. due to heartbeat interruption) and they see each other again, that they might end up in a different Active-Standby allocation than before the interruption. If I'm not mistaken this will be calculated based on the mgmt IP or mgmt MAC-address, right? But what's the reason for switching Active-Standby role during activation of a new installed partition? How can this be prevented?

 

Thank you!

 

Ciao Stefan :)

 

  • This just happened to me for the first time in 7 years of upgrading HA pairs. Not only did the newly booted code go active when it came up, but when I rebooted the other unit(which was standby) it went back to standby while the other unit was rebooting. I have a ticket opened with support to see what happened. I'll let you know if I find anything out that is useful.

     

  • I performed a Hotfix update right now (10.2.4 HF3 -> HF10) and everything was fine. Both units came up as Standby again. But one of my colleagues performed a similar update yesterday (10.2.3 HF1 -> 10.2.4 HF10) and there the second unit was going directly into Active mode after reboot into the new partition. Any ideas what could be the reason for this or what can be checked afterwards to analyze this?

     

    Ciao Stefan :)

     

  • shaggy's avatar
    shaggy
    Icon for Nimbostratus rankNimbostratus

    what type of failover are you using, failover-cable (serial) or network failover (v11+)? if using network failover, what addresses do you have in your network-failover config on each device? finally, are you using HA groups, vlan failsafe, or gateway failsafe?

     

  • Hi shaggy,

     

    as I already mentioned, this was with v10 and we are using network failover only. In this setup we are not using HA groups and vlan or gateway failsafe will not be used from us at all. Failover addresses are:

     

    • Unit1: 192.168.0.1/29
    • Unit2: 192.168.0.2/29

    Unit2 was Standby and after reboot into the new partition it was still Standby. Then we made a manual failover and updated unit1, but after the reboot it became automatically Active instead of staying in Standby.

     

    Ciao Stefan :)

     

  • did you?

    In BIG-IP 10.x, F5 recommends that you configure network failover to communicate over the management network, in addition to a TMM network. To do this you must define a Network Identifier for the management network and a Network Identifier for a TMM network. For more information, refer to the Configuring High Availability section of the TMOS Management Guide for BIG-IP Systems.
    

    sol11736: Defining network resources for BIG-IP high availability features (9.x - 10.x)

    https://support.f5.com/kb/en-us/solutions/public/11000/700/sol11736.html
  • The word from support is that it is recommended to Force Offline the standby unit while you upgrade it. When you are ready to upgrade the primary, Force Offline the primary and Release Offline on the standby. The reason stated is that HA Active/Standby will not work properly on two different versions.

     

    I have to say this has happened to me only once through probably 100s of upgrades using network failover all the way back to version 9.4

     

    • shaggy's avatar
      shaggy
      Icon for Nimbostratus rankNimbostratus
      i've had the same experience. very rarely (knock on wood) through >100 upgrades have i experienced unexpected failovers. i have, however, had unexpected failovers caused by something in my config, such as unexpected behavior from HA groups/vlan failsafe/bad failover addresses. force-offline is nice, but prevents me from seeing that everything came up successfully post-upgrade
  • This night I had another 2 software updates (just the latest Hotfix) with TMOS 11.4.1, one cluster was totally fine (both LBs came up properly as Standby after the reboot into the new partition), but for the other cluster the first reboot ended up in Active mode. The second device of this cluster came up as Standby after the reboot. There is no HA-group nor any other failsafe feature in use. The device which came up as Active has the lower IP on mgmt- and heartbeat-VLAN (if that matters). Network failover is just using the heartbeat VLAN, no traffic via mgmt-IP.

     

    The question is, does this depends on any specific configuration or can be avoided with one or is this some random behavior?

     

    Ciao Stefan :)

     

  • JG's avatar
    JG
    Icon for Cumulonimbus rankCumulonimbus

    With v11.*, force-offline the standby device before applying a hotfix release to it. The offline state will then persist across reboot. I have tried this.

     

    With v10.*, I remember F5 recommends to start the upgrade with the device with the numerically highest mgmt IP address before applying a hotfix release. I'd unplug the network of the data plane if I was not sure - especially in a major upgrade when I might have to rebuild device trust.

     

    With v11., the device with the highest mgmt address will be automatically selected as the active device, if the standby device does not have the persistent force-offline state carried across reboot. This behavior is totally different from v10..

     

  • Hi Jie,

     

    Thank you for the information with force-offline in v11, that sounds logic and we will try it the next time. For v10 we will investigate this further for any upcoming updates.

     

    Thank you!

     

    Ciao Stefan :)