Part 3: Monitoring the health of BIG-IP APM network access PPP connections with a periodic iCall handler
In this part, you monitor the health of PPP connections on the BIG-IP APM system by monitoring the frequency of a particular log message in the /var/log/apm file. In this log file, when the system is under high CPU load, you may observe different messages indicating users' VPN connections are disconnecting. For example:
- You observe multiple of the following messages where the PPP tunnel is started and closed immediately:
Mar 13 13:57:57 hostname.example notice tmm3[16095]: 01490505:5: /Common/accessPolicy_policy:Common:ee6e0ce7: PPP tunnel 0x5604775f7000 (ID: 2c4507ea) started. Mar 13 13:57:57 hostname.example notice tmm3[16095]: 01490505:5: /Common/accessPolicy_policy:Common:ee6e0ce7: PPP tunnel 0x560478482000 (ID: acf63e47) closed.
- You observe an unusually high rate of log messages indicating users' APM sessions terminated due to various reasons in a short span of time.
Mar 13 14:01:10 hostname.com notice tmm3[16095]: 01490567:5: /Common/AccessPolicy_policy:Common:a7cf25d9: Session deleted due to user inactivity.
- You observe an unusually high rate of log messages indicating different user apm session IDs are attempting to reconnect to network access, VPN.
Mar 13 12:18:13 hostname notice tmm1[16095]: 01490549:5: /Common/AP_policy:Common:2c321803: Assigned PPP Dynamic IPv4: 10.10.1.20 ID: 033d0635 Tunnel Type: VPN_TUNNELTYPE_TLS NA Resource: /Common/AP_policy_na_res Client IP: 172.2.2.2 – Reconnect
In this article, you monitor the Reconnect message in the last example because it indicates that the VPN connection terminated and the client system still wants to reconnect. When this happens for several users and repeatedly, it usually indicates that the issue is not related to the client side. The next question is, what is an unusually high rate of reconnect messages, taking into account the fact that some reconnects may be due to reasons on the client side and not due to high load on the system.
You can create a periodic iCall handler to run a script once every minute. Each time the script runs, it uses grep to find the total number of Reconnect log messages that happened over the last 3 minutes for example. When the average number of entries per minute exceeds a configured threshold, the system can take appropriate action. The following describes the procedures:
- Creating an iCall script to monitor the rate of reconnect messages
- Creating a periodic iCall handler to run the iCall script once a minute
- Testing the implementation using logger
1. Creating an iCall script to monitor the rate of reconnect messages
The example script uses the following values:
- Periodic handler interval 1 minute: This runs the script once a minute to grep the number of reconnect messages logged in the last 3.0 minutes.
- Period 3.0 minutes: The script greps for the Reconnect messages logged in the last three minutes.
- Critical threshold value 4: When the average number of Reconnect messages per minute exceeds four or a total of 12 in the last three minutes, the system logs a critical message in the /var/log/apm file.
- Alert threshold value 7: When the average number of Reconnect messages per minute exceeds seven or a total of 21 in the last three minutes, the system logs an alert message and requires immediate action.
- Emergency threshold 10: When the BIG-IP system is logging 10 Reconnect messages or more every minute for the last three minutes or a total of 30 messages, the system logs an emergency message, indicating the system is unusable.
You should reconfigure these values at the top of the script according to the behavior and set up of your environment. By performing the count over an interval of three minutes or longer, it reduces the possibility of high number of reconnects due to one-off spikes.
Procedure
Perform the following procedure to create the script to monitor CPU statistics and log an alert message in the /var/log/ltm file when traffic exceeds a CPU threshold value.
To create an iCall script, perform the following procedure:
- Log in to tmsh.
- Enter the following command to create the script in the vi editor:
create sys icall script vpn_reconnect_script
3. Enter the following script into the definition stanza in the editor:
This script counts the average number of entries/min the Reconnect message is logged in /var/log/apm over a period of 3 minutes.
definition { #log file to grep errormsg set apm_log "/var/log/apm" #This is the string to grep for. set errormsg "Reconnect" #num of entries = grep $errormsg $apm_log | grep $hourmin | wc -l #When (num of entries < crit_threshold), no action #When (crit_threshold <= num of entries < alert_threshold), log crtitical message. #When (alert_threshold <= num of entries < emerg_threshold), log alert message. #When (emerg_threhold <= num of entries log emerg message set crit_threshold 4 set alert_threshold 7 set emerg_threshold 10 #Number of minutes to take average of. E.g. Every 1.0, 2.0, 3.0, 4.0... minutes set period 3.0 #Set this to 1 to log to /var/tmp/scriptd.out. Set to 0 to disable. set DEBUG 0 set total 0 puts "\n[clock format [clock seconds] -format "%b %e %H:%M:%S"] Running script..." for {set i 1} {$i <= $period} {incr i} { set hourmin [clock format [clock scan "-$i minute"] -format "%b %e %H:%M:"] set errorcode [catch {exec grep $errormsg $apm_log | grep $hourmin | wc -l} num_entries] if {$errorcode} { set num_entries 0 } if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $num_entries times."} set total [expr {$total + $num_entries}] } set average [expr $total / $period] set average [format "%.1f" $average] if {$average < $crit_threshold} { if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Below all threshold. No action."} exit } if {$average < $alert_threshold} { if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Reached critical threshold $crit_threshold. Log Critical msg."} exec logger -p local1.crit "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= critical threshold $crit_threshold." exit } if {$average < $emerg_threshold} { if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Reached alert threshold $alert_threshold. Log Alert msg."} exec logger -p local1.alert "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= alert threshold $alert_threshold." exit } if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average in last $period mins. Log Emerg msg"} exec logger -p local1.emerg "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= emerg threshold $emerg_threshold." exit }
4. Configure the variables in the script as needed and exit the editor by entering the following command:
:wq! y
5. Run the following command to list the contents of the script:
list sys icall script vpn_reconnect_script
2. Creating a periodic iCall handler to run the script
In this example, you create the iCall handler to run the script once a minute. You can increase this interval to once every two minutes or longer. However, you should consider this value together with the period value of the script in the previous procedure to ensure that you're notified on any potential issues early.
Procedure
Perform the following procedure to create the periodic handler that runs the script once a minute.
To create an iCall periodic handler, perform the following procedure:
- Log in to tmsh.
- Enter the following command to create a periodic handler:
create sys icall handler periodic vpn_reconnect_handler interval 60 script vpn_reconnect_script
3. Run the following command to list the handler:
list sys icall handler periodic vpn_reconnect_handler
4. You can start and stop the handler by using the following command syntax:
<start|stop> sys icall handler periodic vpn_reconnect_handler
3. Testing the implementation using logger
You can use the logger command to log test messages to the /var/log/apm file to test your implementation. To do so, run the following command the required number of times to exceed the threshold you set:
Note: The following message must contain the keyword that you are searching for in the script. In this case, the keyword is Reconnect.
logger -p local1.notice "01490549:5 Assigned PPP Dynamic IPv4: 10.10.1.20 ID: 033d0635 Tunnel Type: VPN_TUNNELTYPE_TLS NA Resource: /Common/AP_policy_na_res Client IP: 172.2.2.2 - Reconnect"
Follow the /var/tmp/scriptd.out and /var/log/apm file entries to verify your implementation is working correctly.
Conclusion
This article lists three use cases for using iCall to monitor the health of your BIG-IP APM system. The examples mainly log the appropriate messages in the log files. You can extend the examples to monitor more parameters and also perform different kinds of actions such as calling another script (Bash, Perl, and so on) to send an email notification or perform remedial action.
- KinEmployee
Abdessamad,
>> I removed the check "if {$errorcode} {set num_entries 0}" to have the error output logged.
Is there a reason, you prefer to remove the check? When nothing was matched by grep, the exit code is 1. The program enters the if condition and sets num_entries to 0.
If num_entries is not set, you'd encounter the error.
If you prefer not to use grep, you can probably use another command to scan through /var/log/apm for the string: errormsg "Reconnect"
Feel free to edit the code to match your requirements/environment.
- Abdessamad1Cirrostratus
Dear Kin,
I initially had the issue with a different script I created myself, but to be sure I made a copy of your script and tested it and the same issue happened.
Here is the code I'm using:
set apm_log "/var/log/apm" set errormsg "Reconnect" set crit_threshold 4 set alert_threshold 7 set emerg_threshold 10 set period 3.0 set DEBUG 1 set total 0 puts "\n[clock format [clock seconds] -format "%b %e %H:%M:%S"] Running script..." for {set i 1} {$i <= $period} {incr i} { set hourmin [clock format [clock scan "-$i minute"] -format "%b %e %H:%M:"] set errorcode [catch {exec grep $errormsg $apm_log | grep $hourmin | wc -l} num_entries] if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $num_entries times."} set total [expr {$total + $num_entries}] } set average [expr $total / $period] set average [format "%.1f" $average] if {$average < $crit_threshold} { if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Below all threshold. No action."} exit } if {$average < $alert_threshold} { if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Reached critical threshold $crit_threshold. Log Critical msg."} exec logger -p local1.crit "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= critical threshold $crit_threshold." exit } if {$average < $emerg_threshold} { if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Reached alert threshold $alert_threshold. Log Alert msg."} exec logger -p local1.alert "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= alert threshold $alert_threshold." exit } if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average in last $period mins. Log Emerg msg"} exec logger -p local1.emerg "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= emerg threshold $emerg_threshold." exit
I removed the check "if {$errorcode} {set num_entries 0}" to have the error output logged.
And here is what I see in /var/tmp/scriptd.out:
Apr 13 09:44:45 Running script...
DEBUG: Apr 13 09:43: "Reconnect" logged 0
child process exited abnormally times.
Followed by other errors, but that's normal as num_entriesis not set correctly.
I hope this clarifies the issue.
Thanks.
Abdessamad
- KinEmployee
Abdessamad, what do you see in the logs (/var/log/ltm, /var/log/apm, /var/log//var/tmp/scriptd.out).. Can you also paste the script that you're using? If that's not convenient, perhaps the minimum amount of code required to reproduce the issue, in particular the grep command you're using.
Thanks for the question - I'm tagging directly to ensure he sees this.
- Abdessamad1Cirrostratus
Hello,
Any idea why the exec command would not work and we get "child process exited abnormally " instead?
looks like using grep with a variable is producing this error. If I grep on static string it's ok.