We have started some external "monitoring" of our F5's to alert when pool members are marked down. What we have noticed is that a node is marked down for relatively short amount of time generally between 4 and 25 seconds. We see this in the LTM logs. We implemented Monitoring for a specific node and captured one of these events. I can see where the LTM creates the socket to the server and attempts the "ping" of a GET ....I don't see and responses and it does this 3 times according to the 5/16 settings and then states:
(_send_active_service_ping): unable to connect; giving up and marks the node down...
Looking at our Splunk/IIS for the node in question I don't see events for these connections, I see the entries for the good 200 OK responses, but I don't see any of the log entries for connections from the LTM. I am trying to determine how to trouble shoot further, the LTM and Server are on the same vlan so not routing is in place.
Any suggestions on where to look or how I can approach this with the application team/server owners?
You should perform packet capture on the F5 unit and on the server, and compare the times between F5 unit sends probe and server receives it, and when server sends response and F5 unit receives it. (also check, is response delayed on server already?)
If there's a big latency in monitor response, and responses are often received after timeout time, consider tuning your probe and timeout timers so that server will have enough time to reply.
You can maybe issue probes manually with specific settings so that they will be easier to extract from pcap file.