on 12-May-2020 12:12
In this part, you monitor the health of PPP connections on the BIG-IP APM system by monitoring the frequency of a particular log message in the /var/log/apm file. In this log file, when the system is under high CPU load, you may observe different messages indicating users' VPN connections are disconnecting. For example:
Mar 13 13:57:57 hostname.example notice tmm3[16095]: 01490505:5: /Common/accessPolicy_policy:Common:ee6e0ce7: PPP tunnel 0x5604775f7000 (ID: 2c4507ea) started. Mar 13 13:57:57 hostname.example notice tmm3[16095]: 01490505:5: /Common/accessPolicy_policy:Common:ee6e0ce7: PPP tunnel 0x560478482000 (ID: acf63e47) closed.
Mar 13 14:01:10 hostname.com notice tmm3[16095]: 01490567:5: /Common/AccessPolicy_policy:Common:a7cf25d9: Session deleted due to user inactivity.
Mar 13 12:18:13 hostname notice tmm1[16095]: 01490549:5: /Common/AP_policy:Common:2c321803: Assigned PPP Dynamic IPv4: 10.10.1.20 ID: 033d0635 Tunnel Type: VPN_TUNNELTYPE_TLS NA Resource: /Common/AP_policy_na_res Client IP: 172.2.2.2 – Reconnect
In this article, you monitor the Reconnect message in the last example because it indicates that the VPN connection terminated and the client system still wants to reconnect. When this happens for several users and repeatedly, it usually indicates that the issue is not related to the client side. The next question is, what is an unusually high rate of reconnect messages, taking into account the fact that some reconnects may be due to reasons on the client side and not due to high load on the system.
You can create a periodic iCall handler to run a script once every minute. Each time the script runs, it uses grep to find the total number of Reconnect log messages that happened over the last 3 minutes for example. When the average number of entries per minute exceeds a configured threshold, the system can take appropriate action. The following describes the procedures:
The example script uses the following values:
You should reconfigure these values at the top of the script according to the behavior and set up of your environment. By performing the count over an interval of three minutes or longer, it reduces the possibility of high number of reconnects due to one-off spikes.
Perform the following procedure to create the script to monitor CPU statistics and log an alert message in the /var/log/ltm file when traffic exceeds a CPU threshold value.
To create an iCall script, perform the following procedure:
create sys icall script vpn_reconnect_script
3. Enter the following script into the definition stanza in the editor:
This script counts the average number of entries/min the Reconnect message is logged in /var/log/apm over a period of 3 minutes.
definition { #log file to grep errormsg set apm_log "/var/log/apm" #This is the string to grep for. set errormsg "Reconnect" #num of entries = grep $errormsg $apm_log | grep $hourmin | wc -l #When (num of entries < crit_threshold), no action #When (crit_threshold <= num of entries < alert_threshold), log crtitical message. #When (alert_threshold <= num of entries < emerg_threshold), log alert message. #When (emerg_threhold <= num of entries log emerg message set crit_threshold 4 set alert_threshold 7 set emerg_threshold 10 #Number of minutes to take average of. E.g. Every 1.0, 2.0, 3.0, 4.0... minutes set period 3.0 #Set this to 1 to log to /var/tmp/scriptd.out. Set to 0 to disable. set DEBUG 0 set total 0 puts "\n[clock format [clock seconds] -format "%b %e %H:%M:%S"] Running script..." for {set i 1} {$i <= $period} {incr i} { set hourmin [clock format [clock scan "-$i minute"] -format "%b %e %H:%M:"] set errorcode [catch {exec grep $errormsg $apm_log | grep $hourmin | wc -l} num_entries] if {$errorcode} { set num_entries 0 } if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $num_entries times."} set total [expr {$total + $num_entries}] } set average [expr $total / $period] set average [format "%.1f" $average] if {$average < $crit_threshold} { if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Below all threshold. No action."} exit } if {$average < $alert_threshold} { if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Reached critical threshold $crit_threshold. Log Critical msg."} exec logger -p local1.crit "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= critical threshold $crit_threshold." exit } if {$average < $emerg_threshold} { if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Reached alert threshold $alert_threshold. Log Alert msg."} exec logger -p local1.alert "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= alert threshold $alert_threshold." exit } if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average in last $period mins. Log Emerg msg"} exec logger -p local1.emerg "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= emerg threshold $emerg_threshold." exit }
4. Configure the variables in the script as needed and exit the editor by entering the following command:
:wq! y
5. Run the following command to list the contents of the script:
list sys icall script vpn_reconnect_script
In this example, you create the iCall handler to run the script once a minute. You can increase this interval to once every two minutes or longer. However, you should consider this value together with the period value of the script in the previous procedure to ensure that you're notified on any potential issues early.
Perform the following procedure to create the periodic handler that runs the script once a minute.
To create an iCall periodic handler, perform the following procedure:
create sys icall handler periodic vpn_reconnect_handler interval 60 script vpn_reconnect_script
3. Run the following command to list the handler:
list sys icall handler periodic vpn_reconnect_handler
4. You can start and stop the handler by using the following command syntax:
<start|stop> sys icall handler periodic vpn_reconnect_handler
You can use the logger command to log test messages to the /var/log/apm file to test your implementation. To do so, run the following command the required number of times to exceed the threshold you set:
Note: The following message must contain the keyword that you are searching for in the script. In this case, the keyword is Reconnect.
logger -p local1.notice "01490549:5 Assigned PPP Dynamic IPv4: 10.10.1.20 ID: 033d0635 Tunnel Type: VPN_TUNNELTYPE_TLS NA Resource: /Common/AP_policy_na_res Client IP: 172.2.2.2 - Reconnect"
Follow the /var/tmp/scriptd.out and /var/log/apm file entries to verify your implementation is working correctly.
This article lists three use cases for using iCall to monitor the health of your BIG-IP APM system. The examples mainly log the appropriate messages in the log files. You can extend the examples to monitor more parameters and also perform different kinds of actions such as calling another script (Bash, Perl, and so on) to send an email notification or perform remedial action.
Hello,
Any idea why the exec command would not work and we get "child process exited abnormally " instead?
looks like using grep with a variable is producing this error. If I grep on static string it's ok.
Thanks for the question - I'm tagging directly to ensure he sees this.
Abdessamad, what do you see in the logs (/var/log/ltm, /var/log/apm, /var/log//var/tmp/scriptd.out).. Can you also paste the script that you're using? If that's not convenient, perhaps the minimum amount of code required to reproduce the issue, in particular the grep command you're using.
Dear Kin,
I initially had the issue with a different script I created myself, but to be sure I made a copy of your script and tested it and the same issue happened.
Here is the code I'm using:
set apm_log "/var/log/apm"
set errormsg "Reconnect"
set crit_threshold 4
set alert_threshold 7
set emerg_threshold 10
set period 3.0
set DEBUG 1
set total 0
puts "\n[clock format [clock seconds] -format "%b %e %H:%M:%S"] Running script..."
for {set i 1} {$i <= $period} {incr i} {
set hourmin [clock format [clock scan "-$i minute"] -format "%b %e %H:%M:"]
set errorcode [catch {exec grep $errormsg $apm_log | grep $hourmin | wc -l} num_entries]
if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $num_entries times."}
set total [expr {$total + $num_entries}]
}
set average [expr $total / $period]
set average [format "%.1f" $average]
if {$average < $crit_threshold} {
if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Below all threshold. No action."}
exit
}
if {$average < $alert_threshold} {
if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Reached critical threshold $crit_threshold. Log Critical msg."}
exec logger -p local1.crit "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= critical threshold $crit_threshold."
exit
}
if {$average < $emerg_threshold} {
if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average. Reached alert threshold $alert_threshold. Log Alert msg."}
exec logger -p local1.alert "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= alert threshold $alert_threshold."
exit
}
if {$DEBUG} {puts "DEBUG: $hourmin \"$errormsg\" logged $average times on average in last $period mins. Log Emerg msg"}
exec logger -p local1.emerg "01490266: \"$errormsg\" logged $average times on average in last $period mins. >= emerg threshold $emerg_threshold."
exit
I removed the check "if {$errorcode} {set num_entries 0}" to have the error output logged.
And here is what I see in /var/tmp/scriptd.out:
Apr 13 09:44:45 Running script...
DEBUG: Apr 13 09:43: "Reconnect" logged 0
child process exited abnormally times.
Followed by other errors, but that's normal as num_entriesis not set correctly.
I hope this clarifies the issue.
Thanks.
Abdessamad
Abdessamad,
>> I removed the check "if {$errorcode} {set num_entries 0}" to have the error output logged.
Is there a reason, you prefer to remove the check? When nothing was matched by grep, the exit code is 1. The program enters the if condition and sets num_entries to 0.
If num_entries is not set, you'd encounter the error.
If you prefer not to use grep, you can probably use another command to scan through /var/log/apm for the string: errormsg "Reconnect"
Feel free to edit the code to match your requirements/environment.