LTM External Monitors: Troubleshooting

My last article explained the basics of implementing an LTM external monitor script. This article covers some helpful tools and techniques you can use to validate/troubleshoot/debug your external monitor implementation.

There are 2 basic parts to an external monitor:
1) The monitor definition in the LTM configuration; and
2) the external program it calls.

The external program should be validated before configuring an external monitor to use it. I'll continue to use the shell script from the previous article as an example, but the basic technique is the same regardless of the nature of the program.

Validating the external program

Validating the external program is fairly straightforward. Simply run the program at the command line with the expected commandline arguments and the expected environment variables defined. The program must either send something to standard output (stdout) if the pool member is to be marked up, or output nothing to stdout if it is to be marked down. (Output to standard error (stderr) is ignored by the monitoring daemon, but should be resolved before deploying the monitor.)

Start by installing the program in the /usr/bin/monitors directory and make it executable. Instructions are here.

Debugging: Validating program I/O

Identify the environment variables the program requires and export them to the shell at the command line. For our example script, the variables URI and RECV are required, so export them like this:

 export RECV="Server is UP!!!"
 export URI="/testpath/testfile.html" 

Verify they are set as expected by running the "set" command and grepping for the expected variable names:

 set | grep RECV
 set | grep URI 

Execute the program, giving the expected commandline arguments, which at a minimum will be the IP address and port of the pool member to be tested. For our example script, the commandline arguments are simply that -- IP and port of one pool member:

 /usr/bin/monitors/HTTPMonitor_cURL_BasicGET 10.0.0.1 80 

The pool member against which the test is run should be in a known state. I recommend starting with a pool member that's known to be up. If the pool member is up and the script operates as expected, the script should return the expected character string to stdout.

This code snip from our example specifies the character string that will be sent to stdout only if the expected response is seen. The string that will be conditionally output to stdout is "UP":

 if [ $? -eq 0 ]
 then
 echo "UP"
 fi 

The expected result for a healthy node would be:

 monitors # /usr/bin/monitors/HTTPMonitor_cURL_BasicGET 10.0.0.1 80
 UP
 monitors #  
 

Once you have verified that a healthy pool member is marked up as expected by the program, test against a known down pool member. For a known down pool member, the script should return nothing, and the monitoring daemon will mark it down at the expiration of the timeout.

The expected result for an unhealthy node would be no output:

 monitors # /usr/bin/monitors/HTTPMonitor_cURL_BasicGET 10.0.0.1 80
 monitors # 

If you've tested against both a healthy pool member and an unhealthy one, and the script doesn't return the expected output, or if any unexpected output is seen on either stdout or stderr, then some debugging is in order.

The most common not-so-obvious mistake we see in external monitors is uncontrolled output to stdout -- commands that send data to stdout every time the sript runs. For example, pgroven's initial version included the following command to display a variable name & value on screen for debugging purposes:

 echo "exstatus = $exstatus" 

Given that code, the result for a healthy node would be:

 monitors # /usr/bin/monitors/example_script 10.0.0.1 80
 exstatus = 0
 UP
 monitors # 

which is no problem -- the pool member will be marked up because text was seen on stdout. However, the result for an unhealthy node would be:

 monitors # /usr/bin/monitors/example_script 10.0.0.1 80
 exstatus = 1
 monitors # 

and in this case, the pool member would still be marked up even if the server didn't respond at all. Because there is always something sent to stdout, the script would always cause the pool member to be marked up regardless of the server response (or lack thereof).

Debugging: Validating network I/O

Looking at the server logs can be helpful, but the most definitive information is on the ethernet. A packet trace can be used to look directly at the conversation that is happening on the wire (or, as the case may be, not happening on the wire). You'll want to capture on the server-facing interface, and filter for the non-floating self IP on the server-facing VLAN and the port of the pool member:

 tcpdump -nni <server_vlan> -Xs 0 host <internal_non-floating_selfIP> and port <pool_member_port> 

If the expected request traffic is not seen being sent from LTM to the server, it is typically for one of 2 reasons:
1) because the command issuing the request is flawed. We'll address that in the next section; or
2) the network address doesn't appear to be valid.

If the network address doesn't appear to be valid, all the usual suspects come into play here: missing/bad IP address, no route to host, missing/bad L2 address (pool member is hard down or not visible for some other reason), missing/bad port, service not listening, etc. Verify connectivity for L1-4 as you usually would.

Poster pgroven's case highlighted a common issue that manifests in the request never being sent: The monitoring daemon passes the IP address of the pool member to the external program using IPv6's IPv4 mapped format, as explained in the previous article. The IPv6 address passed to the cURL command is not a valid address for the corresponding IPv4 pool member -- it doesn't map to any L2 address. Since there is no valid IP address to which to send the cURL request, no packets will be seen on the wire. Adding the code to translate the IPv6 address format to IPv4 resolves that issue.

This situation raises an important troubleshooting detail: The most authoritative test for all of the script execution test examples shown here would actually be to use the same format the monitoring daemon would when passing the commandline argument for the IP address:

 /usr/bin/monitors/HTTPMonitor_cURL_BasicGET ::ffff:10.0.0.1 80 

Otherwise you're not actually testing the entire script.

Debugging: Validating individual commands

If you've examined a packet trace and you see the request being sent to the server, but no response or an error response are seen, take a closer look at the request itself and make sure that it is valid and well formed.

Using our shell script as an example, you can debug the request itself by issuing the same command at the command line with the appropriate values inserted, adjusting as necessary until the expected response is received. You may need to remove output suppression flags if they were included. For example, the cURL command in the example shell script is embedded in this line of code:

 curl -fNs http://${IP}:${PORT}${URI} | grep -i "${RECV}" 2>&1 > /dev/null 

so you would want to remove the output supression flags (-f & -s) and substitute the appropriate variable values to submit this request from the command line:

 curl -N http://10.0.0.1:80/testpath/testfile.html 

Once any existing problems have been identified and resolved, substitute the corrected command(s) into your script and continue testing.

Debugging: Validating logic

If your request and response look correct, but the script is still not producing the expected output, you'll have to dig into the the script logic.

For a shell script such as the one we're using in our example, the simplest way is to invoke the shell command explicitly with the xtrace flag, passing it the script name and its commandline arguments. The xtrace flag (-x) causes the shell to write each command (preceded by '+') to stderr before it is executed, displaying all variables fully expanded and showing the results of any logical comparisons.

Running this command to test the script

 sh -x /usr/bin/monitors/HTTPMonitor_cURL_BasicGET ::ffff:10.0.0.1 80 

gives this output (on stderr) for pool member NOT returning the expected response:

 ++ echo ::ffff:10.0.0.1
 ++ sed s/::ffff://
 + IP=10.0.0.1
 + PORT=80
 ++ basename /usr/bin/monitors/HTTPMonitor_cURL_BasicGET
 + PIDFILE=/var/run/HTTPMonitor_cURL_BasicGET.10.0.0.1_80.pid
 + '[' -f /var/run/HTTPMonitor_cURL_BasicGET.10.0.0.1_80.pid ']'
 + echo 19955
 + curl -fNs http://10.0.0.1:80/testpath/testfile.html
 + grep -i 'Server is UP!!!'
 + '[' 1 -eq 0 ']'
 + rm -f /var/run/HTTPMonitor_cURL_BasicGET.10.0.0.1_80.pid
 + exit 

A healthy pool member returning the expected response will produce output similar to the following:

 ++ echo ::ffff:10.0.0.1
 ++ sed s/::ffff://
 + IP=10.0.0.1
 + PORT=80
 ++ basename /usr/bin/monitors/HTTPMonitor_cURL_BasicGET
 + PIDFILE=/var/run/HTTPMonitor_cURL_BasicGET.10.0.0.1_80.pid
 + '[' -f /var/run/HTTPMonitor_cURL_BasicGET.10.0.0.1_80.pid ']'
 + echo 20064
 + curl -fNs http://10.0.0.1:80/testpath/testfile.html
 + grep -i 'Server is UP!!!'
 + '[' 0 -eq 0 ']'
 + echo UP
 UP
 + rm -f /var/run/HTTPMonitor_cURL_BasicGET.10.0.0.1_80.pid
 + exit 

All lines prefixed with "+" are output on stderr from xtrace. Notice that in the first example, all output was from xtrace, and nothing went to stdout, so the monitoring daemon would have marked the pool member down after the timeout expired unless another instance marked it up again. In the second, the one line without the + prefix is the expected output to stdout, the string "UP", and the monitoring daemon would have marked the pool member up.

Those are the primary tools and techniques I use for troubleshooting external monitor scripts. You can follow this sequence, or you could do it all backwards, or start in the middle. It is mildly complex, so where you dig in really depends on what you observe and/or your gut instinct about what might be wrong.

Validating the external monitor template configuration

Validation of the external monitor template configuration is simply a matter of comparing its settings to the commandline arguments and variables you used successfully during validation of the external program itself, then applying the monitor to a pool member to watch it in action.

You can list the monitor template from the LTM configuration with the bigpipe command:

 bigpipe monitor <monitor_name> list
 
 bigpipe monitor cURL_BasicGET list 

For our example, the monitor definition looks like this, which reflects the command line arguments and variable values with which we tested above:

 monitor ExternalHTTP {
 defaults from external
 RECV "Server is UP!!!"
 run "HTTPMonitor_cURL_BasicGET"
 URI "/testpath/testfile.html"
 } 

Once you've configured the external monitor template with the name of the script, the commandline arguments, and the variables it requires to function, apply it to a single pool member and monitor the results. If the results are as expected, then Congratulations! you are the proud parent of a functional external monitor. If the pool member is not marked up or down as expected, then start again by examining a packet trace, and make sure command line testing gives you the exact results you expect.


In the next article, I'll show you a neat trick used by another codeshare sample to more aggressively mark pool members down before the timeout expires.

Published Jan 29, 2008
Version 1.0
  • How do you run this if your monitor is in a partition sh -x /usr/bin/monitors/HTTPMonitor_cURL_BasicGET ::ffff:10.0.0.1 80
  • I cannot get this testing to work.

     

    My script is located here: /config/filestore/files_d/Common_d/external_monitor_d/ And expects Host-header, header-range and goodword to be set, but when I do the

     

    export HOST= export HEADERRANGE=0-65535 export GOODWORD=flannen

     

    And no matter what I do I get this:

     

    rdexec 16 sh -x /config/filestore/files_d/Common_d/external_monitor_d/:Common:curl-external_monitor_120542_4 10.75.2.30 80

    This is the output command I get:

     

    curl -fNs http://10.75.2.30:80 -H 'Host:{HOST}' -r '{HEADERRANGE}'

     

    grep -i ' '

     

    Not really useful. It seems like it is not taking notice of the variables I set.