Forum Discussion
jeff_mccombs_47
Nimbostratus
May 06, 2012how have others conquered external monitor resource alloc problems?
I've got a pair of 6900's that are monitoring about 600 backend servers. The servers are polled via an external monitor script that issues a few SNMP gets to check for various overload and down conditions.
This is not scaling well for me. The sheer number of monitors that are running is beginning to consume a significant amount of processor time. I anticipate the backend pool of servers growing to 800-1000 nodes before this is all said and done.
How have others managed large pools of servers with custom monitors?
- LTM 6900/10.2.3
6 Replies
- hwidjaja_37598
Altostratus
How complex is your monitor script?
IMO, external monitor script is quite expensive due to the process forking. Try to reduce number of commands in the script.
As an alternative, offload the monitors to another system (eg. web-based monitor check). So F5 is using HTTP to check the condition of all back-end servers . By doing so, F5 doesn't need to fork new process for each SNMP Get. - jeff_mccombs_47
Nimbostratus
the script isn't too complicated. But i've got about 600 of them running.
Offloading the checks onto a secondary system and polling a results page would work pretty well, but it's still an external check via curl/wget/whatever though I suppose the of commands spawned would be reduced to just one.
Here's one of the monitors. This one only runs on about 200 nodes currently. there's another that is very similar (more snmpgets) that runs on about 400.!/bin/bash DESTIP=`echo $1 | sed 's/:FFFF://'` COMMUNITY=$3 MAXSESSIONS=$4 pidfile="/var/run/monitor_nodes.$1..$2.pid" if [ -f $pidfile ] ; then kill -9 `cat $pidfile` > /dev/null 2>&1 fu echo "$$" > $pidfile MAX=`nice snmpget -v2c -c $COMMUNITY $DESTIP my.really.long.oid | awk '{print $4}'` SESSION_THROTTLE=`nice snmpget -v2c -c $COMMUNITY $DESTIP my.really.long.oid | awk '{print $4}'` DOWN=0 if [ $SESSION_THROTTLE -eq 2 ] ; then DOWN=1 fi if [ "x$MAX" == "x" ] ; then DOWN=1 fi if [ $MAX -lt $MAXSESSIONS ] ; then DOWN=0 fi if [ $DOWN -eq 0 ] ; then echo "up" fi rm -f $pidfile
the TCP Port check is probably the least expensive from the F5's point of view, correct? Wonder if I could get creative with a Tcl or Perl script that runs as a monitor on each node. Have the script open or close a TCP port on the host based on availability. As long as the port is open, the service is alive. If the port shuts, the service is in an overload condition or not running.. seems like a lot of connections, maybe UDP instead.
sounds like a lot of work. - ib_37889
Nimbostratus
Jeff
I have used external monitors that execute on the back-end node via inetd/tcpserver ( assuming backends are un*x -like ) with success, using telnet on the bigip. However , should the backend have a http server I think it would be easier to use a httpmonitor and have a script on the webserver do the same checks the shell script does. You can then check for "up" in the reply.
Right now every 6900 is forking 1 shell with at least 4 sub-shells ( back-quotes ) per back-end
With 600 be's , that makes 3000 instances of bash every x seconds.
Seeing the part with the pidfile and the kill, I assume you have noticed the scripts overlap or hang.
kr,
Ib - smp_86112
Cirrostratus
Hey I thought I'd chime in here because I did some work with external monitors over the past couple of days. It seems that the LTM will not execute any commands after receiving anything in STDOUT (i.e. after the "echo" command). So when the check is successful, the "rm -f $pidfile" command in your script does not get executed. I was troubleshooting the fact that my PID file was not getting cleaned up when I expected it to (my script was similar to yours - echo, then rm). But then I noticed the last comment in this Tech Tip:
https://devcentral.f5.com/Tutorials/TechTips/tabid/63/articleType/ArticleView/articleId/151/LTM-External-Monitors-The-Basics.aspx
FYI: When the external monitor script outputs "UP" than the deletion of the PIDFILE never occurs, neither does any further commands after the output of "UP"
It does however delete the PID if the monitor fails. I can only assume the F5 kills the script immediately once any output is detected...
After I read that I did some more testing, and sure enough, that's what I found too despite not being able to find any documentation on it. Not sure that's directly related to your scalability issue, but thought you might want to be aware of it as you develop your automation. - hoolio
Cirrostratus
I concur. I found that the script is killed after any STDOUT is sent:
https://devcentral.f5.com/wiki/AdvDesignConfig.TemplateForExternalLtmMonitors.ashx
Note that any standard output will result in the script execution being stopped
So do any cleanup before echoing to STDOUT
Aaron - jeff_mccombs_47
Nimbostratus
Thanks for the replies guys.
I had seen the scripts overlap. The SNMP polls are expensive as the BE nodes are running a custom app that provides the SNMP interface and it's a little slow. On top of that, the line card the 6900's are on are a wee bit overloaded at the moment and pretty much everything is slowing down because of it.
And thanks Ib - I was going to look into the pid file thing once I caught my breath. You saved me a ton of legwork there.
I dunno why I didn't think of the HTTP check method. I've done that in the past too. Must be because I'm not getting enough sleep. I must have been a real horrible person in a past life. :)
Appreciate all the feedback guys!
Help guide the future of your DevCentral Community!
What tools do you use to collaborate? (1min - anonymous)Recent Discussions
Related Content
DevCentral Quicklinks
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com
Discover DevCentral Connects