Forum Discussion

jeff_mccombs_47's avatar
jeff_mccombs_47
Icon for Nimbostratus rankNimbostratus
May 06, 2012

how have others conquered external monitor resource alloc problems?

I've got a pair of 6900's that are monitoring about 600 backend servers. The servers are polled via an external monitor script that issues a few SNMP gets to check for various overload and down conditions.

 

 

This is not scaling well for me. The sheer number of monitors that are running is beginning to consume a significant amount of processor time. I anticipate the backend pool of servers growing to 800-1000 nodes before this is all said and done.

 

 

 

How have others managed large pools of servers with custom monitors?

 

 

 

- LTM 6900/10.2.3

 

 

 

  • How complex is your monitor script?

     

     

    IMO, external monitor script is quite expensive due to the process forking. Try to reduce number of commands in the script.

     

     

    As an alternative, offload the monitors to another system (eg. web-based monitor check). So F5 is using HTTP to check the condition of all back-end servers . By doing so, F5 doesn't need to fork new process for each SNMP Get.
  • the script isn't too complicated. But i've got about 600 of them running.

    Offloading the checks onto a secondary system and polling a results page would work pretty well, but it's still an external check via curl/wget/whatever though I suppose the of commands spawned would be reduced to just one.

    Here's one of the monitors. This one only runs on about 200 nodes currently. there's another that is very similar (more snmpgets) that runs on about 400.

    
    !/bin/bash
    DESTIP=`echo $1 | sed 's/:FFFF://'`
    COMMUNITY=$3
    MAXSESSIONS=$4
    pidfile="/var/run/monitor_nodes.$1..$2.pid"
    if [ -f $pidfile ] ; then
     kill -9 `cat $pidfile` > /dev/null 2>&1
    fu
    
    echo "$$" > $pidfile
    MAX=`nice snmpget -v2c -c $COMMUNITY $DESTIP my.really.long.oid | awk '{print $4}'`
    SESSION_THROTTLE=`nice snmpget -v2c -c $COMMUNITY $DESTIP my.really.long.oid | awk '{print $4}'`
    
    DOWN=0
    if [ $SESSION_THROTTLE -eq 2 ] ; then
      DOWN=1
    fi
    
    if [ "x$MAX" == "x" ] ; then
      DOWN=1
    fi
    
    if [ $MAX -lt $MAXSESSIONS ] ; then
      DOWN=0
    fi
    
    if [ $DOWN -eq 0 ] ; then
      echo "up"
    fi
    
    rm -f $pidfile
     

    the TCP Port check is probably the least expensive from the F5's point of view, correct? Wonder if I could get creative with a Tcl or Perl script that runs as a monitor on each node. Have the script open or close a TCP port on the host based on availability. As long as the port is open, the service is alive. If the port shuts, the service is in an overload condition or not running.. seems like a lot of connections, maybe UDP instead.

    sounds like a lot of work.
  • Jeff

     

     

    I have used external monitors that execute on the back-end node via inetd/tcpserver ( assuming backends are un*x -like ) with success, using telnet on the bigip. However , should the backend have a http server I think it would be easier to use a httpmonitor and have a script on the webserver do the same checks the shell script does. You can then check for "up" in the reply.

     

     

    Right now every 6900 is forking 1 shell with at least 4 sub-shells ( back-quotes ) per back-end

     

    With 600 be's , that makes 3000 instances of bash every x seconds.

     

    Seeing the part with the pidfile and the kill, I assume you have noticed the scripts overlap or hang.

     

     

    kr,

     

    Ib

     

     

     

     

     

     

     

     

     

  • Hey I thought I'd chime in here because I did some work with external monitors over the past couple of days. It seems that the LTM will not execute any commands after receiving anything in STDOUT (i.e. after the "echo" command). So when the check is successful, the "rm -f $pidfile" command in your script does not get executed. I was troubleshooting the fact that my PID file was not getting cleaned up when I expected it to (my script was similar to yours - echo, then rm). But then I noticed the last comment in this Tech Tip:

     

    https://devcentral.f5.com/Tutorials/TechTips/tabid/63/articleType/ArticleView/articleId/151/LTM-External-Monitors-The-Basics.aspx

     

     

     

    FYI: When the external monitor script outputs "UP" than the deletion of the PIDFILE never occurs, neither does any further commands after the output of "UP"

     

     

    It does however delete the PID if the monitor fails. I can only assume the F5 kills the script immediately once any output is detected...

     

     

     

    After I read that I did some more testing, and sure enough, that's what I found too despite not being able to find any documentation on it. Not sure that's directly related to your scalability issue, but thought you might want to be aware of it as you develop your automation.
  • I concur. I found that the script is killed after any STDOUT is sent:

     

     

    https://devcentral.f5.com/wiki/AdvDesignConfig.TemplateForExternalLtmMonitors.ashx

     

    Note that any standard output will result in the script execution being stopped

     

    So do any cleanup before echoing to STDOUT

     

     

    Aaron
  • Thanks for the replies guys.

     

     

    I had seen the scripts overlap. The SNMP polls are expensive as the BE nodes are running a custom app that provides the SNMP interface and it's a little slow. On top of that, the line card the 6900's are on are a wee bit overloaded at the moment and pretty much everything is slowing down because of it.

     

     

    And thanks Ib - I was going to look into the pid file thing once I caught my breath. You saved me a ton of legwork there.

     

     

    I dunno why I didn't think of the HTTP check method. I've done that in the past too. Must be because I'm not getting enough sleep. I must have been a real horrible person in a past life. :)

     

     

    Appreciate all the feedback guys!