failover
13 TopicsGTM Pool Members Gone After Maintenance? It's Probably This One Setting
You finish a maintenance window, everything looks good on LTM, and then someone notices Wide IPs are resolving to fewer destinations than before. You check the GTM pools and the members are just... gone. The virtual servers are fine on LTM. GTM just doesn't know about them anymore — and more importantly, it doesn't remember if they were ever pool members. This happens more often than it should, and it almost always comes back to the same thing: virtual-server-discovery enabled doing exactly what it was designed to do, at exactly the wrong moment. What's Actually Going On When virtual-server-discovery is set to enabled on a GTM server object, GTM keeps its view of LTM virtual servers in sync via iQuery. It automatically adds new virtual servers, updates existing ones, and — this is the part that causes problems — deletes virtual servers that LTM stops reporting on. That delete behavior is the issue. Any time iQuery reports zero virtual servers, even temporarily, GTM treats it as a mass deletion event. The virtual servers get pulled from the server object, and with them, their pool memberships. When LTM eventually reports on those virtual servers again, GTM re-discovers them as brand new objects with no memory of which pools they belonged to. Two scenarios trigger this consistently. Scenario 1: LTM Software Upgrade This is the one that catches most people. During an upgrade, LTM reboots and goes through a phase where iQuery can connect but the full configuration hasn't finished loading yet. From GTM's perspective, LTM is reachable but reporting no virtual servers. GTM interprets that as a deletion event, clears out the discovered virtual servers, and empties the pools. When LTM finishes loading and the virtual servers come back, GTM re-discovers them — but the pool memberships are gone. You're left manually rebuilding what was there before the maintenance window started. The telltale sign is pool members coming back in blue/CHECKING state. That only happens to newly discovered objects. GTM treated a returning virtual server as a brand new one — because as far as it's concerned, it is. The GTM log won't show a deletion event, only the re-add. That gap in the logs is a known blind spot with virtual-server-discovery enabled, and it's exactly why the problem is hard to diagnose after the fact. What you'll typically see in /var/log/gtm after the LTM comes back: alert gtmd[xxxxx]: 011a1005:1: SNMP_TRAP: Pool your_pool state change green --> red (No enabled pool members available) alert gtmd[xxxxx]: 011a3004:1: SNMP_TRAP: Wide IP your.wideip.example.com state change green --> red (No enabled pools available) And then shortly after, the virtual servers re-appear in CHECKING state as GTM re-discovers them — but with no pool bindings. Scenario 2: LTM HA Failover This one surprises people because the LTM pair is still running — it's just switching active units. After a failover, the new active device may not have its iQuery connections fully re-established yet. GTM sees the iQuery state as inconsistent, virtual server status updates stop coming through, and members disappear from the discovered list. What makes this harder to diagnose is that tmsh show gtm iquery may show "connected" — but connected doesn't mean the config sync is working correctly. In a GTM sync group, only the device assigned local ID 0 (the GTM with the lowest IP address) is responsible for writing auto-discovery results to the configuration. If that specific device loses its iQuery connection during the failover window, discovery events are missed entirely — even if every other GTM in the group can still reach the LTM. So you can have a situation where five out of six GTMs look perfectly healthy, iQuery shows connected everywhere, and yet pool members are still disappearing — because the one device that matters for discovery is the one with the broken connection. You can check which device in your sync group holds local ID 0 with: tmsh list sys db gtm.peerinfolocalid If that device's iQuery connection to the LTM is the one that dropped during the failover window, that's your answer — even if everything else looks fine. The Fix: enabled-no-delete Both scenarios share the same root cause: GTM's auto-delete behavior treating a temporary iQuery disruption as a permanent deletion event. The fix is the same for both: gtm server /Common/site1-ltm { addresses { 10.1.1.1 { device-name site1-ltm } } datacenter /Common/dc1 monitor /Common/bigip virtual-server-discovery enabled-no-delete } With enabled-no-delete, GTM still auto-discovers new virtual servers and keeps existing ones updated. The only thing that changes is that it will never delete a virtual server just because LTM temporarily stopped reporting it. Your pool memberships survive both scenarios above. Mode Adds new VS Updates VS Deletes VS Pool memberships survive iQuery disruption? disabled No No No Yes — nothing changes enabled Yes Yes Yes No — any disruption can empty pools enabled-no-delete Yes Yes No Yes — preserved The Trade-Off enabled-no-delete won't clean up after you when you intentionally decommission a virtual server on LTM. The stale GTM object stays in the discovered list until you remove it manually. In environments with a lot of VS churn, this can accumulate over time. The question is which failure mode you'd rather manage: pool members silently disappearing during a maintenance window, or occasionally needing to clean up stale objects after a planned decommission. For most production environments, the latter is far easier to deal with — and far less likely to wake someone up at 2am. How to Make the Change Via tmsh: tmsh modify gtm server /Common/site1-ltm \ virtual-server-discovery enabled-no-delete tmsh save sys config Via GUI: Go to DNS → GSLB → Servers Select the server object Set Virtual Server Discovery to Enabled (No Delete) Click Update This takes effect immediately and does not affect existing discovered virtual servers or current pool memberships. Cleaning Up Stale Objects When you intentionally decommission a virtual server on LTM, remove the leftover GTM object manually: # List virtual servers under a GTM server object tmsh list gtm server /Common/site1-ltm virtual-server # Remove a specific stale entry tmsh modify gtm server /Common/site1-ltm \ virtual-servers delete { /Common/old-vs-name } tmsh save sys config Make this part of your standard VS decommission runbook and stale objects will never pile up. Quick Diagnostic When Members Go Missing Before assuming it's a discovery issue, check iQuery health across all GTM devices first: tmsh show gtm iquery Look for: State: should be connected to all entries Reconnects: A high count suggests instability even if the connection looks up Configuration Time: None means the config has never successfully synced from that LTM Then confirm which GTM holds local ID 0 and verify its connectivity specifically: tmsh list sys db gtm.peerinfolocalid If the local ID 0 device is the one with the broken iQuery connection, that's your answer — regardless of what the other devices are showing. Wrapping Up Whether it's an LTM upgrade or an HA failover, the pattern is the same: iQuery goes quiet for a moment, GTM interprets silence as deletion, and your pool memberships are gone. It's working as designed — just not in a way that's useful to you. enabled-no-delete is a one-line change that stops this from happening. The cleanup overhead it introduces is predictable and manageable. The alternative — rebuilding pool memberships after an unplanned event — is not. Have you run into either of these scenarios in your environment? Drop a comment below, especially if you've seen the local ID 0 shift cause issues during a rolling GTM upgrade.90Views1like0CommentsConfiguration Assistance: Configure Email Alerts for HA Failover Events and Device Offline
We have a BIG-IP VE High Availability Pair deployed in Microsoft Azure. We need to configure the BIG-IP to automatically send an email notification to our Operations teams immediately when a Failover event occurs(When the unit goes from Active to Standby or Offline) Could you provide the recommended procedure for the configuration to trigger these email alerts?162Views0likes3Commentsfailover issue between datacenters with GSLB
Ok I have been looking for the best solution for an application requirement. I have an app that will be living in two datacenters. Each location has 4 servers running. What they want is the application to run 100% of the time in DC1 unless 2 or more servers fail in DC1 and then failover to DC2 I thought this would be rather easy, but I have been unable to find a way to get the VIP to go into a down state if two servers have gone down. Does anyone have a idea how I could implement this? Thanks332Views0likes8CommentsA device recently failover, but I don't know why.
Hi, I recently had a device failover, but I can't find anything in the logs to indicate why. log Oct 11 09:13:11 dvice_A notice tmm[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm2[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm2[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm5[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm5[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm6[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm6[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm3[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm3[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm4[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm4[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm7[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:11 dvice_A notice tmm7[13231]: 01340011:5: HA unit 1 state change: from 1 to 0. Oct 11 09:13:20 dvice_A notice mcpd[6384]: 0107168c:5: Incremental sync complete: This system is updating the configuration on device gro up /Common/dg device %cmi-mcpd-peer-/Common/dvice_B from commit id { 81368 7424061905973374382 /Common/dvice_A } to commit i d { 85137 7424304469001948692 /Common/dvice_A }. Oct 11 09:13:20 dvice_A notice mcpd[6384]: 0107168c:5: Incremental sync complete: This system is updating the configuration on device gro up /Common/dg device %cmi-mcpd-peer-/Common/dvice_B from commit id { 81368 7424061905973374382 /Common/dvice_A } to commit i d { 85137 7424304469001948692 /Common/dvice_A }. Oct 11 09:18:09 dvice_A notice icrd_child[13462]: 13462,13469, iControl REST Child Daemon, INFO,Transaction [1728605588792715] execu tion time expired. Oct 11 17:45:40 dvice_A info platform_agent[7260]: 01e10007:6: Token is renewed. Oct 11 17:45:40 dvice_A info platform_agent[7260]: 01e10007:6: Token is renewed. thank you437Views0likes4CommentsNetwork failover - peer-offline
Hello, I think I'll need advices or at least some opinions, here... On the cluster of F5 we manage, the secondary node passed master, one month ago. Besides, I see, in the GUI, the button "force failover" is greyzed. So Impossible to make a failover from that. But.. Maybe I could force it in CLI... I am not yet sure. I didn't try that, for now (it is not our cluster, so... I must be careful). Anyway... when I have made tests on the clusters, I found that : show cm failover-status -------------------- Status STANDBY (...) ----------------------------------------------------------------------------------------------------------- adress IP1:1026 nodename_Sec 0 1 - Error adress IP21026 nodename_Sec 0 1 - Error adress IP3:1026 nodename_Sec 30334301 3 2024-Sep-09 16:48:55 Ok (PS. I do not indicate the real address / node name, of course, here...) # show /cm traffic-group (...) ------------------------------------------------------------------------------------------------- traffic-group-1 nodename_Pri standby true false - traffic-group-1 nodename_Sec active false false peer-offline # show /sys failover Failover active for 35d 04:03:10 Well, there is 3 address used for the configSync. The 2 first one are self IPs. They are configured with a port lockdown "none". Normally, it is not correct, that is ok, I know it. It should be configured on "default" or "allow all". BUT the management IP work well, obviously. We have a status "ok" for this one. So... Basically, I should be able to make a "failover, in that case, In first view. Except no. Because the button "force failover" is grey. However, I see too the "peer offline" with my cmd "show /cm traffic-group". That means I should be in that situation : https://my.f5.com/s/article/K000137178. But... the "network -pan" doesn't show me any "sod off". So, I am not sure of that, after all. So, 1/ Do you know if the fact I see the "peer-offline" explain, itself, why my button "force failover" is grey ? 2/ The fact we have only the management IP usable for the configSync is functionnal, according to you ? Could it explain too all the problem ? 3/ I do not see "sod off" with a "netstat -pan" (Cf. the Kb I shared her above). In despite of that, do you think I should restart the sod ? Brief, is someone knew a similar situation and would have an opinion or a suggestion about it, please ? Have a nice day end! Best regards, Christian158Views0likes1CommentForce peer to standby using Ansible!?
So, I have read over a number of articles found online regarding determining which nodes are active and attempt to perform the "force to standby" action, thus, in turn, causes the HA peer to become active. (Same as clicking on the button by the same name in the GUI). So, I have successfully generated a variable that contains the node that needs to sent to standby. That part works thus far. However, when I then go to use "f5networks.f5_modules.bigip_node" ansible module to perform the failover, it tells me that the task resulted in a "change", but the node does not go offline and its peer does not become active. - name: Force the failover of the B unit peer... f5networks.f5_modules.bigip_node: state: offline fqdn: "{{ host_to_failover.name }}" name: "{{ host_to_failover.name }}" provider: server: "{{ host_to_failover.name }}" server_port: 443 validate_certs: false no_f5_teem: false user: admin password: "{{ admin_acct_password }}" delegate_to: localhost Output: TASK [debug] ********************************************************************** ok: [f5-r5900-a.its.utexas.edu] => { "host_to_failover": { "name": "f5-r5900-revprox-b.its.utexas.edu", "state": "active" } } TASK [Force the failover of the B unit peer...] *************************************** changed: [f5-r5900-a.its.utexas.edu] Some confusion about parameters: See: ansible module > bigip_node It requires a "name" parameter, even though I am not adding a "node", so I just populate it with the hostname I was to "force to standby". It also asks for It asks for an fqdn/address. I guess this is the node I want to perform the "force to standby" on? So I also populate that with the hostname I was to "force to standby". There is also the "server" within the "provider", and does that have any affect on requesting this action? I tried putting the node to failover here, as well as its peer, but it does not make any difference. Nothing I do, actually causes a "failover" to the "A peer" from the "B peer". HELP!? What is the best way (example please?) to "force to standby" a node, to cause a failover, and the HA peer to become the active peer?286Views0likes1CommentHow to force close TLS sessions in a failover scenario
Hi, We have an application behind Big-IP which doesn't handle failovers well. The Big-IP keeps all TLS sessions consistent and open during failover but the application doesn't support TLS resume for a session and this causes problems in the app. I'm looking for a way to close TLS sessions for a specific VS in a failover scenarios. We're on version 16.1.4.1 Any suggestions? Thanks764Views0likes5CommentsVLAN Failsafe failover settings change on STANDBY device - affect ACTIVE device?
we have two devices in an HA group but the failsave is VLAN and set to fail over on both devices. If I turn off VLAN failsafe on the standby device, does that affect the HA group or ACTIVE device?Solved554Views0likes1Comment