Forum Discussion

ShaunSimmons's avatar
ShaunSimmons
Icon for Altostratus rankAltostratus
Mar 05, 2025

Standby Has Fewer Online VIPs Than Active – Requires Manual Monitor Reset

 

Hello F5 community,

I’ll preface this by saying that networking has been verified as fully routable between the Active and Standby units. Both devices can ping and SSH to each other’s Self-IPs, and rebooting the Standby did not resolve the issue.

Issue: Discrepancy in Online VIPs Between Active & Standby

Despite being In-Sync, the Active and Standby units show a different number of Online VIPs.

  • If I randomly select one or two VIPs that should be online, remove their monitors, and then re-add them—BOOM, the VIP comes online.
  • The VIPs in question were both HTTPS (443).

Side Note: Frequent TCP Monitor Failures

In my environment, I also frequently see generic ‘TCP’ monitors failing, leading to outages. While I understand that TCP monitoring alone isn’t ideal, my hands are tied as all changes must go through upper management for approval.

Has anyone encountered a similar issue where VIPs don’t come online until the monitor is manually reset? Any insights into potential root causes or troubleshooting steps would be greatly appreciated!

Thanks in advance.

 



  • Which one out of the active and standby device is generally showing more online VIPs? Do you notice any patterns in terms of a particular pool or pool member(s) failing their health checks more often than others? Are you using the same nodes across a large number pools? Do you have health monitors applied also at the node level or is it just at the pool / pool member level?

     

    I would recommend checking out the following article which has good tips on how to troubleshoot health monitors:
    Troubleshooting health monitors

    From past experience, I have had cases where the issue with health check failures were attributed to the back-end pool member and other cases where it was the BIG-IP. The BIG-IP by default uses the "bigd" daemon to send health check probes on the control plane. If you have a significant number of pools with health check monitors applied, then it could be getting overwhelmed. As per the previously mentioned article, you can check the memory usage of the "bigd" daemon by running the following command:

     

    ps aux | grep bigd


    If you notice that the memory usage for this daemon is high, you may want to consider switching to "in-TMM" monitoring which makes the BIG-IP send health check probes using TMM (data plane) instead of the bigd (control plane)

    More information about in-TMM monitoring here 

    • ShaunSimmons's avatar
      ShaunSimmons
      Icon for Altostratus rankAltostratus

      Thank you for the responses -
      Troubleshooting monitors -- that is not an issue. The problem I am experiencing is out of the ordinary.

       

      -=TMOS 17.1.1.3=-

      Total(s)  126 VIPs, 138 pools, 220 nodes  --No noticeable patterns. 

           The VIPs and servers are all on their respective IP subnets.--A non-complicated network topology.
                Ex:  VIP(s) 192.168.10.x / Server(s): 192.168.11.x

      "bigd": I am used to environments with over 2000 VIPs / 10000> nodes.  My current role the LTMs are basically desk paper-weights. Bigd is in a good state.

       

      The nodes do not have a monitor configured; only at a pool level.
      -To test to see if bigd had a problem I switched monitoring to TMM.  No dice.  

      Below:  Both the active and standby are able to reach the pool members with no issues with curl ( port is open )

      Standby - Monitor statusActive - Monitor status

       

  • Have you recently upgraded ? check for duplicate IPs also.  Our network team ran into this the Standby showed the VIPs up but the Active Showed the vips down. There wasnt really a fix, it was after a upgrade. Boxes rebooted and it began to stabalize it self out.   

  • Sounds like a possible issue with stale monitor states or a sync problem between the units. Have you tried forcing a full config sync and clearing the monitor stats? Also, check if there's any persistence in the monitoring cache causing this. If removing/re-adding the monitor fixes it, maybe something’s getting stuck in the process. For the TCP monitor failures, could be network latency, firewall interference, or just strict timeout settings. Might be worth tweaking those if possible.