the script isn't too complicated. But i've got about 600 of them running.
Offloading the checks onto a secondary system and polling a results page would work pretty well, but it's still an external check via curl/wget/whatever though I suppose the of commands spawned would be reduced to just one.
Here's one of the monitors. This one only runs on about 200 nodes currently. there's another that is very similar (more snmpgets) that runs on about 400.
!/bin/bash
DESTIP=`echo $1 | sed 's/:FFFF://'`
COMMUNITY=$3
MAXSESSIONS=$4
pidfile="/var/run/monitor_nodes.$1..$2.pid"
if [ -f $pidfile ] ; then
kill -9 `cat $pidfile` > /dev/null 2>&1
fu
echo "$$" > $pidfile
MAX=`nice snmpget -v2c -c $COMMUNITY $DESTIP my.really.long.oid | awk '{print $4}'`
SESSION_THROTTLE=`nice snmpget -v2c -c $COMMUNITY $DESTIP my.really.long.oid | awk '{print $4}'`
DOWN=0
if [ $SESSION_THROTTLE -eq 2 ] ; then
DOWN=1
fi
if [ "x$MAX" == "x" ] ; then
DOWN=1
fi
if [ $MAX -lt $MAXSESSIONS ] ; then
DOWN=0
fi
if [ $DOWN -eq 0 ] ; then
echo "up"
fi
rm -f $pidfile
the TCP Port check is probably the least expensive from the F5's point of view, correct? Wonder if I could get creative with a Tcl or Perl script that runs as a monitor on each node. Have the script open or close a TCP port on the host based on availability. As long as the port is open, the service is alive. If the port shuts, the service is in an overload condition or not running.. seems like a lot of connections, maybe UDP instead.
sounds like a lot of work.