Forum Discussion

Martin_Sharratt's avatar
Martin_Sharratt
Icon for Nimbostratus rankNimbostratus
Nov 21, 2014

Inconsistent Monitors

We have two F5 LTM's in a Sync-Failover pair. They're load balancing some critical production services. They are automatically sync'ed and they both say they are in sync. I inherited them a couple of weeks ago with minimal handover from someone who resigned.

 

We have a number of services and the monitors for these services are set at pool level. There is a variety of monitors, some are built-ins and some are custom.

 

The primary shows all services and nodes as up. The standby shows some nodes and some complete services as down. Hovering over the red diamond on a 'down' node on the standby gives a message of the form "Offline(Enabled) {Monitor} failed to connect. Failed to succeed before deadline @ {date}" . The dates vary between 1 and 5 weeks ago.

 

I've tried the following to diagnose:

 

  1. Telnetting from the management interface of the standby to all 'down' services on the relevant port for each service. No problems encountered.
  2. Pinging the 'down' nodes from the management interface of the standby, and its internal and external interfaces. All are successful.

From this I think it's safe to assume that there are no firewall or routing issues.

 

I've run packet tracing on both the primary and standby LTM's. This shows:

 

  1. No traffic between the management interface and any node. This is the same on both LTM's.
  2. Conversations between the internal IP address of the primary and all nodes on a regular basis.
  3. Coversations between the internal IP address of the secondary and some of the nodes (the 'up' ones).
  4. No traffic between the internal IP address of the secondary and any of the 'down' nodes

This leads me to think that the monitoring is done via the internal interface and that the secondary has decided not to recheck 'down' nodes.

 

According to the BIG-IP Local Traffic Manager: Monitors Reference I should be able to enable and disable monitor instances. I thought that doing this might cause the secondary to attempt to contact the down modes. Although I can list the instances of any monitor, I'm not given any way to select the instance and thereby enable/disable it. This is on monitors created before I was given this to support and a test one I've just created.

 

So, I have several questions:

 

  1. Is this normal behaviour?
  2. If not, (or even if it is) what happens when there's a failover?
  3. Is there any way of manually forcing the secondary to rescan?
  • shaggy's avatar
    shaggy
    Icon for Nimbostratus rankNimbostratus
    1. Is this normal behaviour? no
    2. If not, (or even if it is) what happens when there's a failover? the nodes may still fail
    3. Is there any way of manually forcing the secondary to rescan? bigd is the process that handles monitoring, so restarting that service may kick something into gear on the standby unit
      tmsh restart sys service bigd

    also, since this is strange behavior, you should open a case with f5 support. they may be able to help you better-identify potential issues

  • I'm having the same inconsistent monitor problems with 2 HA pairs on my environment. My primary Big-IPs 4K are showing all nodes on green, but the stand by is marking some nodes down. I proceeded to reload both stand by units, and the issue is still present. Please advise if there any other troubleshooting steps that I can try.

     

    Regards,

     

    WH

     

  • I raised a support case with F5 who said that this is due to a bug in the version my F5's were on (11.5.1). the cure is to upgrade to 11.6.0 which I've now done and which has solved the problem.

     

    In the interim (it took a few weeks before I could update for various unrelated reasons), running this command at a bash prompt on the primary forces them back in sync

     

    /usr/bin/tmsh modify cm device-group cluster name devices modify {primary {set-sync-leader } }

     

    As we had a service that is stopped by the administrators daily, I created a cron job to run this command once a day in the early hours of the morning.

     

    Martin S

     

  • Thanks for the information Martin S. I'm currently running version 11.5, so I will be getting a maintenance window to upgrade to 11.6 as soon as possible.

     

    Regards, -WH

     

    • Martin_Sharratt's avatar
      Martin_Sharratt
      Icon for Nimbostratus rankNimbostratus
      Hi david. F5 support pointed me at the following KB article - I think it has the Bug ID. https://support.f5.com/kb/en-us/solutions/public/14000/600/sol14639.html