LTM: Interface Failsafe

Nathan McMahon and Kirk Bauer bring us another slick solution for a couple of common requirements in high performance environments: To fail over immediately when a single physical network connection loses connectivity, or if the number of active members in a trunk falls below the configured minimum.

For example, many customers want to fail over immediately when the egress router-facing interface loses connectivity.  Others have minimum throughput requirements on multi-port trunks and need to monitor the trunk & fail over when more than 1 port in the trunk fails.

Existing failsafe mechanisms don't quite address these requirements. Interface Failsafe is the answer.

Failsafe Definitions

As of BIG-IP LTM version 9.0 – 9.4.x there are two primary mechanisms built into LTM to determine if it has lost connectivity with the network. Once determined, the LTM has the option of initiating a reboot or simply failing over to its peer in the redundant pair. The two built in mechanisms are Gateway Failsafe and VLAN Failsafe.

Both are excellent solutions, however they don't necessarily cover all scenarios. For instance, Gateway Failsafe requires that it actively ping or ‘touch’ an IP on the network. This can be difficult as there often isn’t a stable IP address on an internal server network behind the LTM. The typical scenario is to have the LTM query the default gateway, but on the server network the LTM is the default gateway of the servers. It can’t reliably query the IP address of its peer because if the standby LTM loses network connection, then the active LTM would believe it had lost connection as well and would initiate a failover.

VLAN Failsafe is a very robust solution relying on a combination of passive listening for traffic and generating or soliciting traffic if it hasn’t received any data for a specified period of time. While robust, this also tends to translate into relatively long intervals between the time the LTM became disconnected from the network and when it signals a failover. Due to potential spanning tree convergence times, this timeout can be as high as 90 seconds, or as optimal as 10 seconds if Port Fast is supported by the switches.

What Is Interface Failsafe?

Interface Failsafe is a custom health monitor coupled with a method of configuring the LTM in order to allow the system to fail over based on the health of the physical link status. Interface Failsafe is meant to solve a very specific problem: When an interface becomes unplugged or otherwise loses connectivity to the network, the LTM needs to signal a failover immediately rather than try to wait for traffic to occur on a disconnected interface. It also has the ability to support M-of-N interfaces within a trunked VLAN.

Interface Failsafe provides:

  • A lightweight mechanism to respond to a failure at Layer 1
  • Very quick response to a physical change. In a system that is not overwhelmed by health checking servers or other management tasks, it is possible to provide significantly faster failover times than previously available with VLAN Failsafe
  • A mechanism to determine link status without requiring an active interaction with other devices at Layer 2 or Layer 3 (such as ARP or Ping of an IP address).
  • Ability to support M-of-N redundancy:  If 2 out of 4 ports in a VLAN trunk are down, then signal a failover. See M-of-N  Redundancy section for more details.
  • A complementary addition to the capabilities of VLAN Failsafe. Interface Failsafe can run concurrently with VLAN Failsafe to provide the Layer 1 and Layer 2/3 checking.

Interface Failsafe does not detect a port that has been disabled or blocked in software on the switch. It only checks physical link layer discontinuity such as the cable being unplugged.

M-of-N  Redundancy

A common request is to health check a trunked VLAN. If the trunked VLAN has 4 interfaces then it might not be appropriate to signal a failover if only one of the interfaces is down. However, if two of the four are down, then you may want to initiate the failover. The Gateway and VLAN Failsafe options do not support this capability, as they would require all four interfaces to fail before signaling a failover. It is possible to monitor the trunk state and fail over via iControl, but a solution which doesn't depend on an external system to monitor the trunk state is preferable. With Interface Failsafe you can incorporate this ability.

Enabling Interface Failsafe

1.  Copy the appropriate script from the codeshare to /usr/bin/monitors/interface_failsafe.eav

2.  Type: chmod 700 /usr/bin/monitors/interface_failsafe.eav

3.  From the web UI create a new monitor called interface_failsafe

  • Type: External
  • Interval/Timeout: Recommended starting values are 5 second interval, 11 second timeout.
    This monitor is fairly lightweight, so it can be run at relatively aggressive intervals compared to typical server health checks. Keep in mind that a heavily loaded LTM may still require backing off of the interval.  Timeout must be greater than the interval.
  • External Program: /usr/bin/monitors/interface_failsafe.eav
  • Arguments:
    • For interface failsafe only (no trunk support), enter a list of the interfaces for which Interface Failsafe should be enabled. Use spaces to separate multiple entries: 1.1 1.7 2.1

 

 

    • For interface failsafe w/Trunk Minumum Active Members support, you can also include in the list of interfaces value sindicating the name of a trunk and the minimum number of interfaces which should be up to avoid failover. For example, the following arguments will make sure that the 1.1 interface is up and test_trunk has at least two interfaces in it that are up:  1.1 test_trunk=2

 


4.  Create a new pool

  • Name: Interface_Failsafe_Pool_1
  • Health Monitor: Select the interface_failsafe_monitor
  • Address / Port: Pick any IP / Port. This data is discarded later and never used. This does not need to be a real server or device.

5.  Create a second new pool

  • Name: Interface_Failsafe_Pool_2
  • Health Monitor: Select the interface_failsafe_monitor
  • Address / Port: Pick any IP / Port.

 


6.  Create the gateway failsafe: System > High Availability > Fail-safe > Gateway

  • Gateway Pool: Interface_Failsafe_Pool_1
  • Unit ID: 1
  • Threshold: 1 (number of members which must be available in the pool to avoid failover)
  • Action: Best practice is to Fail Over, not reboot.

7.  Create a second gateway failsafe

  • Gateway Pool: Interface_Failsafe_Pool_2
  • Unit ID: 2
  • Threshold: 1
  • Action: Fail Over

 


8.  Synchronize the configuration to the LTM’s peer and test.

Best Practices

  • As with any failover mechanism, it is prudent to regularly test and ensure that it is configured correctly and will behave as you expect. While the Interface Failsafe can provide a greater level of availability insurance, it is important to understand exactly what it is and isn’t providing.
  • Setting both units to reboot is generally never recommended. If both LTMs are plugged into the same switch and the switch dies, then both LTMs will continuously reboot until an administrator manually intervenes. Consider setting only one LTM to reboot or preferably just setting both to the Fail Over option.
  • Interface Failsafe can also be used in conjunction with VLAN failsafe to support both Layer 3 (network traffic connectivity) and Layer 1 connectivity. This provides the robustness of the VLAN Failsafe with the fast detection of link down status by the Interface Failsafe.


Links
Monitor Script Source

Get the Flash Player to see this player.
Published Mar 05, 2008
Version 1.0
  • Hello,all.

     

     

    I used to two virtual BIGIP that HA Lab, but fail,

     

     

    the two BIGIP version is "BIG-IP 10.1.0 Build 3341.1084 Final "

     

     

    BIGIP_1-------(HA mode Network Failover )--------BIGIP_2

     

     

    HA mode Active-active

     

     

    High Availability : Redundancy, unit 1 of BIGIP_1 device,unit 2 of BIGIP_2 device.

     

     

    BIGIP_1: ConfigSync Peer Address set 198.18.19.252 ,BIGIP_1 management ip address is 192.168.0.105

     

    BIGIP_2: ConfigSync Peer Address set 198.18.19.251 ,BIGIP_2 management ip address is 192.168.0.106

     

    Detect ConfigSync Status enable

     

     

     

    BIGIP_1: failover1|198.18.19.251|198.18.19.252|1026

     

    failover2|192.168.0.105|192.168.0.106|1026

     

     

    BIGIP_2: failover1|198.18.19.252|198.18.19.251|1026

     

    failover2|192.168.0.106|192.168.0.105|1026

     

     

    network mirroring:

     

    BIGIP_1:Self _198.18.19.251,remote_198.18.19.252

     

    BIGIP_2:Self _198.18.19.252,remote_198.18.19.251

     

     

     

    High Availability-Fail_self--gateway:

     

     

    add pool : failover_pool, pool member only include 198.18.19.1(default gateway),health moniter select "gateway_icmp"

     

     

    ok, them can Synchronize configuration,but not switch over ,

     

     

    down BIGIP_2 , BIGIP_1 can't take over BIGIP_1 's traffic .

     

     

    b unit command on BIGIP_1 , display unit 1 :

     

    [root@BIGIP_1:Active] config b unit

     

    UNITS 1

     

     

    reboot BIGIP_2, b failover standby command on BIGIP_2,and b unit command on BIGIP_1,dispaly:

     

     

    [root@BIGIP_1:Active] config b unit

     

    UNITS 1 and 2

     

     

    but shutdown BIGIP_2,

     

     

    [root@BIGIP:Active_] config b unit

     

    UNITS 1 and 2

     

     

    why ? ?

     

     

    more /var/log/ltm

     

     

    Apr 2 23:04:07 local/BIGIP notice sod[2598]: 010c0044:5: Command: go failback BIGpipe.

     

    Apr 2 23:04:25 local/BIGIP notice sod[2598]: 010c0020:5: Unit 1 and 2

     

    Apr 2 23:05:00 local/BIGIP notice sod[2598]: 010c0020:5: Unit 1

     

    Apr 2 23:05:03 local/BIGIP notice mcpd[2596]: 01070640:5: Node 198.18.19.252 monitor status down.

     

    Apr 2 23:05:03 local/tmm err tmm[3626]: 01010028:3: No members available for pool failover_pool

     

    Apr 2 23:05:05 local/BIGIP notice mcpd[2596]: 01070638:5: Pool member 198.18.19.252:0 monitor status down.

     

     

     

    how kill this problem ? HA still does not work, switch to complete the normal process of proposed solutions or have you ideas?

     

     

    Thanks !
  • Don_MacVittie_1's avatar
    Don_MacVittie_1
    Historic F5 Account
    Hey gavincheng,

     

     

    I believe that High Availability is not supported in this version of BIG-IP LTM-VE. Notice in the release notes that "Redundant system configuration" is on the list of unsupported features.

     

     

    Probably not the answer you were looking for, but I hope it resolves your question at least!

     

     

    Regards,

     

    Don.
  • Vincent_Li_9688's avatar
    Vincent_Li_9688
    Historic F5 Account
    I have modified the script to avoid potential false failover.

     

     

     

    !/bin/bash

     

    Copyright 2007 F5 Networks, Inc.

     

    Kirk Bauer kirk@f5.com

     

    modifed by Vincent Li v.li@f5.com to avoid possible I/O buffer

     

    Pass in each interface to monitor via the Arguments field in the GUI

     

    Collect arguments (first remove IP and port as we don't use those)

     

    shift

     

    shift

     

    interfaces="$*"

     

    b interface show > /tmp/b_interface_show

     

    for i in $interfaces ; do

     

    status=`b interface $i show | tail -1 | awk '{print $2}'`

     

    status=`grep "^ *$i " /tmp/b_interface_show | awk '{print $2}'`

     

    if [ "$status" != "UP" ] ; then

     

    logger -p local0.notice "$MON_TMPL_NAME: interface $i is not up (status: $status)"

     

    exit 1

     

    fi

     

    done

     

    All specified interfaces are up...

     

    logger -p local0.notice "$MON_TMPL_NAME: interface $i is up (status: $status)"

     

    echo "up"

     

    exit 0

     

  • Has anyone tried this in version BIG-IP 10.2.1 Build 297.0 Final? Both nodes in active/standby configuration goes into standby mode. I tried bad IP for both pools and bad ip for poo-1 and good IP for pool-2. None seem to work.
  • Confirmed, this solution (both interface and VLAN) works on BIG-IP 10.2.1 Build 297.0 with HF2
  • Has anyony tried "Trunk" Failsafe ?

     

    I test it on version 10.2.1 511.0 ~

     

    But seems not work.

     

    The pool member always down.
  • Dears, Can some one assist me to understand what the script below exactly did: shift shift interfaces="$*" interfaces="1/2.2 2/2.2" b interface show > /tmp/b_interface_show for i in $interfaces ; do status=`grep "^ *$i " /tmp/b_interface_show | awk '{print $2}'` logger -p local0.notice "interface $i is parced and status is $status" if [ "$status" = "DN" ] ; then logger -p local0.notice "$MON_TMPL_NAME: interface $i is DOWN (status: $status)" for f in $interfaces ; do logger -p local0.notice "$MON_TMPL_NAME: bring down other interfaces" if [ $f != $i ] ; then b interface $f disable fi done echo "failed" > /tmp/int_fail_state exit 1 fi if [ "$status" = "UP" ] ; then state=`cat /tmp/int_fail_state` if [ "$state" = "failed" ] ; then logger -p local0.notice "$MON_TMPL_NAME: interface $i is back UP (status: $status)" for f in $interfaces ; do logger -p local0.notice "$MON_TMPL_NAME: bring UP interface $f in group ($interfaces)" b interface $f enable done echo "ok" > /tmp/int_fail_state fi fi done All specified interfaces are up... echo "up" exit 0 I have Active F5a and Standby F5b and need to do fail safe interface between them. Once I apply it on F5a (active) and interfaces 1/2.1 and 2/2.2 still up; F5a changed to standby and another F5 didn't go active. I have other monitor for interface fail safe for interfaces 1/2.1 1/2.3 2/2.1. Please I need to understand this script very clearly to know the reason that made F5a goes to standby. Thanks.