LTM External Monitors: The Basics

LTM's external monitors are incredibly flexible, fairly easy to implement, and especially useful for monitoring applications for which there is no built-in monitor template. They give you the ability to effectively monitor the health of just about any application by writing custom scripts to interact with your servers in the same way users would.

In this article, I will attempt to explain the basic LTM external monitoring paradigm, then dissect and explain one of the sample monitors from the Advanced Design & Config codeshare.

(Thanks to poster pgroven for inspiring me to finally write this up.)

An "External Monitor" is a script that is "external" to the configuration file which contains specific logic designed to interact with your servers to verify the health of load balanced services. LTM runs a unique instance of the custom-crafted script against each pool member to which it is applied, passing command line arguments and environment variables as specified in the monitor definition calling the script. The script logic formulates and submits a request (or requests) to the target pool member, evaluates the response(s), and manages the pool member's availability based on the results of the response evaluation.

The Tools

The sample monitor scripts
The external script itself should be a shell script (if at all possible) to minimize overhead. If absolutely necessary, a perl script may be used instead, but keep in mind that the overhead of invoking the intepreter and required modules for multiple instances may negatively impact performance overall. However, LTM was not intended to be a development platform or a dedicated monitoring device, and thus has a limited set of development tools and modules included in the software build, so you may not find the perl modules you need. You can add them, but it is not recommended or supported to do so, and those customizations will likely not survive an upgrade. (You can also use an external monitor to invoke a compiled program, but that discussion is beyond the scope of this article.)

cURL
cURL is a very flexible command line tool you can use in shell and perl scripts for complex interactions with HTTP and FTP servers.

netcat
netcat is another useful command line tool that facilitates interaction with TCP and UDP services.

The LTM external monitor template
The LTM external monitor template allows you to specify the name of the script to run, the interval & timeout, command line arguments and variables the script requires, and alternate destination for the monitor traffic.

The Tips
("good to know" stuff and best practices recommendations)

There are a few special considerations you need to make when writing the script and configuring the LTM monitor definition that calls it.

Do you really need an external monitor?
Never use an external monitor when a built-in one will work as well. Forking a shell and running even the simplest shell script takes a significant amount of system resources, so external monitors should be avoided whenever possible. If possible, have the server administrator script execution of the required transaction on the server itself (or locate/author an alternative script on the server) that reliably reflects its availability. Then, instead of an external monitor, you can define a built-in monitor that requests that dynamic script from the server, and let the server run the script locally and report results. For example, the simple request/response HTTP transaction in the sample script below would be much better implemented using the built in basic HTTP monitor.

Optimization
Use the lowest overhead tools, make the simplest possible request, & minimize the amount of response parsing required to determine the pool member's status. The logic of the script can contain just about any logic you want to determine if that server is healthy. You can use commandline tools like netcat and cURL to replicate server transactions, from a basic request and response parsing for an expected string, to more complicated exchanges where cookies or persistence tokens are used, login is required or some other dynamic transaction must take place in order to establish that usability of a server by its intended users.

Redundant pairs
Both units in a redundant pair will independently run the configured monitor, even when running as Standby. Monitor status is not shared between redundant pairs.

Variables
Variables may be passed to indicate service or hostname, the URI you need to request, or just about any piece of information that would be needed to construct a valid query and receive a valid response from the server. Variables can contain static values, basic regex expressions, or even expressions that contain other variables. As long as your script receives the expected variables from the monitor definition and the logic handles them appropriately, the possibilities are fairly limitless.

Authentication
If your script must pass authentication tokens to the pool members to sufficiently transact with them, make sure the authentication method will allow multiple concurrent logins. Each pool-member-specific instance on each member of a redundant pair may attempt to log in simultaneously.  If only a single login is allowed per credential, authentication collisions will most likely result in rolling multiple concurrent false downs as only one monitor request can succeed at a time.

Script against one pool member
The script should be written to determine the health of one specific pool member. An LTM monitor script is really a template for monitoring a single pool member. Whether you apply an LTM monitor to an individual pool member or to the entire pool in the GUI, a separate copy of the monitor runs for each pool member, passing only that specific IP & port to be tested and maintaining only that single tested pool member's availability.  (Discrete monitoring of a single pool member by an external monitor is especially important if other monitors will also be applied to the pool members.)

Minimize the work
Keep the amount of work your monitor script must perform as small as possible. Both the script that runs on LTM and the request against the server itself should represent the minimum interaction required to adequately determine the server's health. If you consider how often the monitor will make that request against each pool members, you can get an idea of the scale of the work that you're asking both big IP and your servers to do.

The Ins & Outs

A script intended for use as an external monitor must conform to some specific input and output requirements.

Command line arguments
IP and port of the pool member are passed automatically as the first 2 command line arguments for all external monitors. The IP address is always passed in the IPv6 format (TMOS' internal address format). IPv4 addresses are passed using IPv6's special "IPv4 mapped address" transition notation: The IPv4 address prefixed with "::ffff:". In that notation, the IP address for pool member 10.0.0.1:80 would be "::ffff:10.0.0.1". The proper address type is critical to proper operation of your monitor script. More on that later.

Additional command line arguments may be defined in the monitor configuration. When defined, they are passed to the script by the monitoring daemon as the 3rd, 4th, and subsequent arguments.

Variables
Variables in the form of Name/Value pairs may also be defined in the monitor configuration. When defined, they are created as environment variables in the shell forked for each instance of the script.

Script Output
IF ANY VALUE AT ALL GOES TO STANDARD OUTPUT, THE POOL MEMBER WILL BE MARKED UP.

If the pool member is determined to be healthy enough to receive load balance traffic by successfully satisfying the script logic, the script should output any value but null to standard output, and the monitoring daemon will mark the pool member up.

If the pool member does not respond as expected, the script will output nothing to stdout, and the lack of output will cause the monitoring daemon to mark the pool member down at the expiration of the timeout.

All other outputs from the script are ignored by the monitoring daemon.

The Timing

The interval is the amount of time that will elapse between the start of each monitor attempt. In order to avoid creating a Denial of Service situation by sending your servers excessive monitor traffic, you should increase the interval as much as possible. The interval MUST be longer that the longest possible healthy response should take, since each successive instance of the script run against a pool member will kill off any already-running previous instances, assuming they are hung and will never complete.

F5 recommends a timeout value 3 times greater than the interval value plus 1 second, but you can use a different ratio if necessary. Setting the timeout shorter than the interval is not recommended.

If you consider the monitor will make that request every <interval> against each pool member, you can get an idea of the scale of the work that you're asking both LTM and your servers to do, so some careful testing is in order with the goal of minimizing the timeout value and maximizing the interval. (If you notice that your healthy pool members are being marked down and then back up again on the next interval, your timeout may be to short, and some further experimentation may be in order.)

There are also ways that you can control and tighten up the tolerance for timing in some monitors. In another article, we will take a closer look at a different external monitor that marks pool members down a little bit more aggressively than waiting for the monitor timeout.

The Gory Details

Here's a sample monitor from the codeshare: HTTPMonitor_cURL_BasicGET

Let's go through the script a section at a time and take a closer look at what's going on. First of all, notice that the script documentation tells us it is expecting 2 variable definitions:

 # This example expects the following Name/Value pairs:
 # URI = the URI to request from the server
 # RECV = the expected response (not case sensitive)

For this example, we are going to request the URI "/testpath/testfile.html" over HTTP for each server, and expect a string that says "Server is UP!!!". (As noted earlier, the simple request / response HTTP transaction demonstrated here would be much better implemented using the built in basic HTTP monitor with static request/receive strings, but is still helpful in demonstrating the basic requirements for external monitor implementation.) Now that we know what variables we need to define, the monitor configuration will look like this:

 monitor ExternalHTTP {
 defaults from external
 RECV "Server is UP!!!"
 run ""
 URI "/testpath/testfile.html"
 } 

Once the monitor is defined, it can be applied to the pool members. (The monitor can be applied to individual pool members or the entire pool. Either way, a unique instance of the script is run for each pool member at each interval to monitor each pool member independently.)

When the monitoring daemon (bigd) runs the script according to the monitor definition, it forks a new shell and and creates the required environment variables, then invokes the script with the 2 default command line arguments (the target pool member's IP address and port).

At the start of the script, the command line arguments are processed. First it checks if the IPv6 address passed is in the IPv4 mapped format, and if so, converts it to a standard IPv4 address instead, and assigns both arguments to named environment variables:

 # remove IPv6/IPv4 compatibility prefix (LTM passes addresses in IPv6 format)
 IP=`echo ${1} | sed 's/::ffff://'`
 PORT=${2} 

Once the IP and PORT variables are defined, they are used to set up a process management scheme intended to prevent multiple copies of the monitor from running against the same pool member at the same time. It works like this: Each instance of the script first looks for a unique file named "monitorname.IP_port.pid" in /var/run containing the process ID of the last instance of the script run against that pool member. If it exists, it means the last instance of the script has not completed. Since multiple copies of the same script funning against the same pool member may interfere with proper monitor operations, the script kills that process, then re-writes the PID file containing the process ID of the current instance for reference by the next instance.

 PIDFILE="/var/run/`basename ${0}`.${IP}_${PORT}.pid"
 # kill of the last instance of this monitor if hung and log current pid
 if [ -f $PIDFILE ]
 then
 kill -9 `cat $PIDFILE` > /dev/null 2>&1 
 fi
 echo "$$" > $PIDFILE 

Now the heavy lifting begins. In this example, we're simply sending a URI, and examining the response to see if it contains the RECV string:

 # send request & check for expected response
 curl -fNs http://${IP}:${PORT}${URI} | grep -i "${RECV}" 2>&1 > /dev/null 

(Remember this is a simplified example. In a real world example, the logic inserted here would replicate whatever transactions you identified earlier as the minimum required interaction to determine the pool member's health. cURL has a wide range of options you can use to mimic almost any browser operation, including sending and receiving cookies, to replicate multi-step transactions or validate complex responses.)

If the expected response contained the value of the RECV variable, the "grep" command will return 0, causing the script to send the string "UP" to stdout, and the pool member will be marked up immediately. If the expected response did NOT contain the value of the RECV variable, the "grep" command will return a non-zero value, and the script will output nothing to stdout, and the pool member will be marked down when the timeout expires.

 # mark node UP if expected response was received
 if [ $? -eq 0 ]
 then
 echo "UP"
 fi 

And finally the script will delete the PID file written earlier (since it has finished cleanly and won't need to be killed of by the next instance) and then exit:

 rm -f $PIDFILE
 exit 


It doesn't work... what now?

Troubleshooting external monitors can be challenging. In my next article, I'll cover the basic process you can follow to track down and resolve any issues that may interfere with proper monitor operation. (LTM External Monitors: Troubleshooting)

Published Jan 25, 2008
Version 1.0
  • uni's avatar
    uni
    Icon for Altostratus rankAltostratus
    Stripping the ::ffff from the IPv6 address is only effective for route domain %1. If your nodes are in any other route domain, you are still stuck.

     

    Ideally, you need to use the IPv6 address as supplied in the command line parameter, so your script will work whether you use route domains or not. The way to do this is to turn off globbing and put square brackets around the address:

     

     

    curl -fNsg http://[${IP}]:${PORT}${URI}

     

  • FYI: When the external monitor script outputs "UP" than the deletion of the PIDFILE never occurs, neither does any further commands after the output of "UP"

     

     

    It does however delete the PID if the monitor fails. I can only assume the F5 kills the script immediately once any output is detected...
  • Just to add one more thing to the below advice - as long as your monitor node supports IPv6, you can modify the script to REMOVE the route domain from the IP string that the F5 passes to your script, by replacing the: IP=`echo ${1} | sed 's/::ffff://' with IP=`echo ${1} | sed 's/%100//' (or whatever your route domain is...) Then follow the suggestion below and the square brackets around the IP variable, and add the g to the curl command.
  • And the new finding is that the script execution STOPS when anything is written to the standard output!!! How can this be?!! well it is that way. so clean up BEFORE anything is written!

     

    I'm going to expand my comment. Is it documented that this is the behavior of the external monitor scripts? If so can someone point it out to the rest of us?!! Please!!!!

     

    I have seen a number of scripts on devcentral and almost ALL of them try to have cleanup code AFTER the success is indicated by writing to standard output (via Echo or otherwise).

     

    Almost ALL of them have this process/PID management business in them and have a cleanup end the end.. Which as many have observed (and probably spent hours of time trying to figure out what was wrong) it doesn't cleanup so the next run tries to kill a process that doesn't exist...

     

    A hole with this is that it might kill some other process as it doesn't appear from examples I have seen that it really checks that the process it is going to kill is actually one that is from this same monitor.

     

    Then I wonder why this is even necessary.. How often, if ever, would a process still be lingering? These should be quick and lightweight monitors.. A couple examples are very extensive.. and some are structured very nicely, but end up introducing overhead with all the elegance (not that I am opposed to elegance and well written code).

     

    So then I finally have to comment on how the external monitor reports back the success or failure. This is just full of problems IMHO that can lead to it indicating success when in reality some error or other issue occurred that writes something to the standard output resulting in two things: 1. The script STOPS at that point. 2. It indicates success - that the health monitor indicates the server is good for use.

     

    Oh my gosh, really? This is a very sloppy hand off.. Can the developers look at this and realize how much the passed off to the writer of the external monitor to know and be aware of? (again is this outlined anywhere?).

     

    A hand off should be clean and clear. Yes it succeeded, No it did not succeed. Other systems have return codes.. Zero is success anything else is a failure code or other indicator. Or they have an agreed to return.. which could be a string to match success/failure.

     

    Why does the script stop when something is written (indicating success)?.. Perhaps to reduce the 'overhead' of the monitor. But almost anyone writing these seem to be unaware of this behavior and why would anyone writing a script need to think.. oh something got written out.. the script is over?

     

    Off the soapbox.. but the design and the behaviors really makes it confusing for those who would like to write an external monitor.