Forum Discussion
wtwagon_99154
Nimbostratus
Feb 10, 2010Health Check Issues
The Scenario:
We are performing some load testing of an application and are using a health check as follows:
GET /blah/blah.asmx HTTP/1.1\r\nUser-Agent:Mozilla\r\nhost:blahblah.blah.net\r\n
We look for a 200 OK
Monitor interval: 5 seconds
Timeout: 16 seconds
We have 2 servers in the pool and they are all green at the moment.
The Problem:
As soon as we start throwing LoadRunner load tests at this VIP, all the servers go to red almost immediately(we are throwing a small amount of load, around 40 concurrent sessions). What really gets me is that I can run the same exact request as above during the load test and I am able to successfully get there and receive a 200 OK.
I also tried swapping the interval to 15 seconds with a timeout of 46 seconds and received the same problem (although this time, it happened after 5 minutes instead of almost immediately).
What steps can be taken to try to diagnose why the F5 is thinking these nodes are down, when they are clearly not?
18 Replies
- hoolio
Cirrostratus
Hi,
Do you see any errors in the server access logs when the monitors fail? Do you have connection limits set for the pool members or node addresses? Which LTM version are you running?
I added some monitor troubleshooting tips to this page:
http://devcentral.f5.com/wiki/default.aspx/AdvDesignConfig/TroubleshootingLtmMonitors.html
If you see anything that could be added to the page, either add it or reply here with comments.
If the issue only occurs during a (light) load test, it will be fun to pick through the logs and/or tcpdumps to find the problem, but it should still be possible to troubleshoot this.
Thanks,
Aaron - wtwagon_99154
Nimbostratus
Posted By hoolio on 02/10/2010 8:35 AM
Hi,
Do you see any errors in the server access logs when the monitors fail? Do you have connection limits set for the pool members or node addresses? Which LTM version are you running?
I added some monitor troubleshooting tips to this page:
http://devcentral.f5.com/wiki/default.aspx/AdvDesignConfig/TroubleshootingLtmMonitors.html
If you see anything that could be added to the page, either add it or reply here with comments.
If the issue only occurs during a (light) load test, it will be fun to pick through the logs and/or tcpdumps to find the problem, but it should still be possible to troubleshoot this.
Thanks,
Aaron
Thanks for the quick reply.
Running version 10.0.0 on build 5514 (HF2) on a BIGIP 1600 LTM.
No connection limit set on the VIP level, pool level, or node level.
I'll take a look at the link you provided as well just to see if I can pick anything out. Logs aren't very conclusive and just show things like:
Wed Feb 10 10:42:47 EST 2010 local/wayne-lc2 notice mcpd[2029] 01070727 Pool member 192.168.201.15:80 monitor status up.
Wed Feb 10 10:42:54 EST 2010 local/wayne-lc2 notice mcpd[2029] 01070638 Pool member 192.168.201.13:80 monitor status down.
Wed Feb 10 10:42:54 EST 2010 local/tmm err tmm[984] 01010028 No members available for pool tilap.80 - hoolio
Cirrostratus
Does the peer unit also mark the pool members down at the same time? Do you see anything interesting in the web server logs?
Aaron - wtwagon_99154
Nimbostratus
What's strange is that the peer unit has not marked the servers down during that same time.
IIS logs are coming back with 200s for the health checks.. Interesting. - wtwagon_99154
Nimbostratus
Just wanted to see if anyone else had some thoughts about monitor interval and timeout.
What is the typical rule of thumb for monitoring intervals and timeouts? Timeout should be the Interval x 3 + 1? Is 5 seconds with a 16 second timeout too much? - hoolio
Cirrostratus
The timeout = 3 x interval + 1 is a best practice as it allows the server three chances to respond successfully before being marked down. A 5 second interval has been a good place to start as a test request this often shouldn't overload the servers. And 16 seconds is not a horribly long time to wait to mark a failed server down.
If 16 seconds is too long for you to wait to mark a dead member down, you could reduce the timeout or use an iRule to mark the pool member down (LB::down in the LB_FAILED event). If your servers are being overwhelmed with too many requests, you could extend the interval and timeout longer than 5 / 16.
I'd still try troubleshooting why it's failing with the default of 5/16 during a light load test before tinkering too much with the monitor timings.
Aaron - wtwagon_99154
Nimbostratus
also, what do you think about the inband monitoring? We obviously aren't doing much application level monitoring here (we use another system for that type of monitoring). The inband monitoring + active monitoring looks pretty cool - just wanted to see if someone has some experience with it and their thoughts on the performance of it. - hoolio
Cirrostratus
The main purpose of a monitor is to ensure that any pool member LTM sends traffic to can handle the traffic. So it's ideal if you can configure the monitor to check a page which hits the database and checks the full state of the application. Inband monitoring is useful for ensuring the network layer is working. As you suggest, the two in combination work well.
Aaron - wtwagon_99154
Nimbostratus
Got a nice little update..
Did some testing using the Passive + Active vs Active with success. I was able to get the F5 to successfully keep the servers in the pool during the load test using Passive + Active. When I stopped IIS on a server, it did its thing and pulled the server out, and then put it back in.. etc.
I then ran some debugging with tcpdump like you article said and discovered something alarming. The F5 would be going along good, polling every 5 seconds, and then it would just stop polling for a few minutes, and then start up again. During that time where it stopped, the servers would both be marked as down. I'm not sure what the deal is there, but I plan to open up a ticket with support to diagnose this issue. Smells like a bug to me (maybe) - hoolio
Cirrostratus
That's interesting. Does top show the monitoring daemon, bigd, using a lot of CPU? Does CPU0 show high usage from other processes?
Let us know what you find out from the support case.
Thanks,
Aaron
Recent Discussions
Related Content
DevCentral Quicklinks
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com
Discover DevCentral Connects