Inconsistent Monitors
We have two F5 LTM's in a Sync-Failover pair. They're load balancing some critical production services. They are automatically sync'ed and they both say they are in sync. I inherited them a couple of weeks ago with minimal handover from someone who resigned.
We have a number of services and the monitors for these services are set at pool level. There is a variety of monitors, some are built-ins and some are custom.
The primary shows all services and nodes as up. The standby shows some nodes and some complete services as down. Hovering over the red diamond on a 'down' node on the standby gives a message of the form "Offline(Enabled) {Monitor} failed to connect. Failed to succeed before deadline @ {date}" . The dates vary between 1 and 5 weeks ago.
I've tried the following to diagnose:
- Telnetting from the management interface of the standby to all 'down' services on the relevant port for each service. No problems encountered.
- Pinging the 'down' nodes from the management interface of the standby, and its internal and external interfaces. All are successful.
From this I think it's safe to assume that there are no firewall or routing issues.
I've run packet tracing on both the primary and standby LTM's. This shows:
- No traffic between the management interface and any node. This is the same on both LTM's.
- Conversations between the internal IP address of the primary and all nodes on a regular basis.
- Coversations between the internal IP address of the secondary and some of the nodes (the 'up' ones).
- No traffic between the internal IP address of the secondary and any of the 'down' nodes
This leads me to think that the monitoring is done via the internal interface and that the secondary has decided not to recheck 'down' nodes.
According to the BIG-IP Local Traffic Manager: Monitors Reference I should be able to enable and disable monitor instances. I thought that doing this might cause the secondary to attempt to contact the down modes. Although I can list the instances of any monitor, I'm not given any way to select the instance and thereby enable/disable it. This is on monitors created before I was given this to support and a test one I've just created.
So, I have several questions:
- Is this normal behaviour?
- If not, (or even if it is) what happens when there's a failover?
- Is there any way of manually forcing the secondary to rescan?