In this article we look at how to to minimize dropped connections during a planned failover in an Active/Stand-by setup using the Google Cloud Load Balancer (GCLB) in front of BIG-IP. This is a companion article to the video demo on DevCentral YouTube (linked below).
You are doing your job when you fail.
You are successful at your job when you fail fast and recover.
BIG-IP can use gratuitous ARP to trigger a very fast device failover in an active/stand-by (A/S) configuration in an on-premises environment. In a cloud environment there is no Layer 2 environment to support ARP. Instead the BIG-IP must rely on DNS, API calls, or a “native” load balancer to trigger failover. This can vary failover time from seconds to minutes depending on the failover mechanism.
It may seem odd but load balancing load balancers is not uncommon. Two-tiers can provide an easier architecture to maintain and provide greater flexibility. In this scenario the Google Cloud Load Balancer will be acting as a L4 TCP proxy that brings traffic to the BIG-IP to apply L7 HTTP policies (I.e. WAF protection).
Getting started in the cloud can be challenging. F5 provides Google Deployment Manager (GDM) Templates to expedite your launch into the cloud. We will focus on the “via-lb” template that takes care of setting up a pair of BIG-IP in an A/S configuration and using GCLB in front. It can also be used for an A/A configuration, this is described later in the article. There is also a “via-api” method that does not use GCLB that can be useful in scenarios where it is desired to have direct access to the packet and/or a non-TCP based service. For more information about deploying in GCP with BIG-IP please visit: https://clouddocs.f5.com/cloud/public/v1/google_index.html
After deploying the “via-lb” GDM template you will have a scenario where the GCLB is sending traffic to the active BIG-IP.
The template also creates two virtual servers. One for data-plane traffic (port 80) and one for monitoring (port 40000).
When a failover is triggered the BIG-IP that was previously active begins to drop connections immediately. This is due to both the data-plane and monitoring service will default to dropping connections when a BIG-IP goes into stand-by mode. Since the GCLB considers the device that was previously “stand-by” as down (health checks were failing while in stand-by mode) it has to wait until it detects the second device as “healthy” before sending traffic to that device.
Depending on the threshold of how GCLB is configured this could take 10-15 seconds to occur. This isn’t terrible (acceptable for most basic use cases), but bad if you have an application that is sensitive to dropped connections.
To make failover occur faster we can modify the settings of the BIG-IP to improve the failover time. The two changes are:
Changing the traffic group to NONE of the virtual-address of the virtual server will cause the BIG-IP to always respond to traffic. This is not safe to do in an on-premises environment (will create IP address conflicts), but is OK to perform in a cloud environment where the networking is “owned” by the cloud provider. This allows the BIG-IP to still accept connections when it is in stand-by mode.
When you change the virtual-address to traffic group NONE this creates an Active/Active (A/A) configuration. Both the data-plane and health monitor will always respond to traffic. In order to make it act as an A/S configuration you need to make an additional change.
The second change is to create a separate health monitor that will track the state of the BIG-IP. This will allow GCLB to detect when a device has gone into stand-by mode and stop sending traffic to that device.
One method of configuring the health monitor is to create a new health monitor on a “bogus” virtual server (I.e. 255.255.255.254) that is on the default “traffic-group-1". The existing monitor can be replaced with a new monitor that uses an iRule to target the “bogus” virtual. The result of this is the health monitor will go “down” when the BIG-IP is in stand-by, but the “real” virtual server will still accept traffic.
This failover can be seen in the demo video where you can see both the “Bad” and the “Good” failover occurring simultaneously, and you can compare the results (spoiler alert: Good is better).
To help visualize the failover I created a Python script that displays “All Good” while the service is up. When a failure is detected it displays “Trouble!”. In the video you can see that the “Bad” failover drops connections for ~20 seconds. The “Good” failover does not drop any connections. The script tests for failures by establishing a new connection once a second and timing out if the connection takes longer than 1 second to be created.
Bad Failover: 20 seconds of dropped connections
Good Failover: No dropped connections
This article featured Google Cloud Platform, but the same methodology can be applied in other cloud environments as well. It is not always necessary to use A/S or it may be OK to drop connections for a brief period of time; but in case it isn’t, you now know how to fail faster in the cloud.