Using VPC Endpoints with Cloud Failover Extension

Introduction

Have you heard of the new F5 Cloud Failover Extension? Well if you haven’t, I encourage you to go out and read about this new feature. CFE is an iControl LX extension that provides L3 failover functionality in cloud environments, effectively replacing Gratuitous ARP. CFE supports TMOS 14.1.x and later. This new feature provides some great benefits such as standardized failover patterns across all clouds, portability and a very important benefit, Lifecyle-Supportability which means you can upgrade your BIG-IP’s without having to call F5 support to fix failover. The CFE works well and is pretty fast by cloud failover standards (remember we are using API’s) but it has a sticky requirement. It needs to access Amazon API’s and this generally means access to the internet via an EIP or NAT Gateway. For most customers this ok but for my customer it was a deal breaker.

The Requirement

Deploy traditional active/standby failover in an environment that cannot use Elastic IP’s or a NAT Gateway while using the Cloud Failover Extension. By the way if you are interested, the fine F5 Cloud Architect Michael O’Leary has a write up on deploying BIG-IP in AWS without EIP’s. It will give some context to what the addressing or routing paths may look like for this scenario.

You might be asking yourself “what a weird request” but in the DoD or Federal space this is a common use case. Customers may sit in closed networks or sit behind a CAP or (Cloud Access Point) and the only connection from the CSP is a direct connect to a base somewhere in the world.

Testing

Let me first say that I am not a CFE expert, if you want to dig into the source-code I encourage you to do so. But I will offer my testing observations and here they are:

When deploying with EIP’s, the failover behavior was that the EIP tied to the VIP moved to the new Active BIG-IP, much like you would expect a floating IP to do... but remember, this is the cloud, we don’t have traditional failover.
When I removed the EIP’s and setup a NAT Gateway the CFE would move the secondary private IP from the former Active to the new Active BIG-IP. A bit different behavior but failover still worked.
Finally, failover would not work if the EIP’s or a NAT Gateway were not available to allow access to public Amazons API’s.

So how do you allow access to Amazon API’s without EIP’s or a NAT Gateway?

VPC Endpoints to the rescue!

The Setup

My AWS environment consisted of a single VPC living in AWS Govcloud. I had a single route table with three subnets. Three security groups with proper access configured. In addition, I configured two VPC Endpoints.

My BIG-IP’s were deployed using a Cloud Formation Template from the official F5 GitHub. This was a 3-NIC active/standby API failover template. I recommend using the templates if deploying to a greenfield because everything is configured for you, EIP’s if you choose, all of the cloud libs including the CFE and other goodies like service discovery and all of the tagging for CFE is done automatically. However, if you have a brownfield deployment and wish to install Cloud Failover Extension then just visit the site and follow the installation instructions.

In addition to my BIG-IP’s I have a Windows client and NGINX web server for failover testing. The Windows client sits on the external subnet and the NGINX server on the internal.

After running the json template in AWS Cloud Formation my BIG-IP’s were up and running and already in active/standby with all of the prereqs loaded up. Disclaimer: if you choose not to enable EIP’S at first launch the cloud libs will not install, they need access to the internet to reach GitHub repos. I would recommend running first with the EIP’s and then removing them later after they fully boot up.

We are going to assume that you have already changed the password and removed any public EIP’s from the BIG-IP’s. If you are not using a jump box to access the management interfaces, then you will need to keep public EIP’s on the management interfaces for access.

Ok, let’s get started

VPC Endpoints

This is what makes the magic work. A VPC endpoint enables you to privately connect your VPC to supported AWS services and VPC endpoint services without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect. Instances do not require any public IP addresses and traffic never leaves the Amazon network.

We will need to create two endpoints, an S3 and EC2 endpoint. S3 is needed because the CFE uses a bucket to store state and credentials. EC2 is needed for allowing updates to the route tables and ENI IP assignment based on the current state. An important note here, EC2 uses DNS so we will need to configure private DNS names later on the EC2 endpoint.

Create the S3 Endpoint

You will need to go to the VPC section of the AWS console and click on Endpoints on the middle left and click Create Endpoint. Service Category is AWS Services and then select Service Name S3. It will look very similar to com.amazonaws.us-gov-west-1.s3, depending on your region.

Next, select your VPC and then select the route table you want to associate the S3 endpoint with. Leave the Policy as Full Access unless you have a requirement for a Custom policy. If everything looks ok, then click Create Endpoint and close. It may take a moment to become available.

Now let’s take a look at the route table. As we can see the endpoint added a prefix list of IP space to the route table, it does this because the S3 endpoint is of a gateway type. When we create the EC2 next, it will not add a route table entry because its type is an interface.

Create the EC2 Endpoint

Go back to the Endpoints and let’s create another Endpoint. Leave AWS services and then find the EC2 service name. Depending on your region it will look similar to com.amazonaws.us-gov-west-1.ec2. Now select your VPC and then choose the Availability Zone and Subnet you want to put the endpoint in, remember this is an Interface so you can put this anywhere as long as its reachable. In my example I am putting it in my internal subnet. VERY IMPORTANT, make sure you check Enable DNS name for this endpoint. This uses DNS and requires the private A record for ec2.%region%.amazonaws.com not the public IP’s. I made this mistake and it would not work…don’t make the same mistake.

Leave the default Full Access policy and click Create Endpoint.

Let’s take a closer look at the EC2 endpoint. Click Subnets and view the IPv4 Addresses.

As you can see the IP lives in the subnet you selected and is the DNS entry point for ec2.%region%.amazonaws.com. This FQDN is what Cloud Failover Extension queries when updating network objects in AWS. Let’s run a test from one of the BIG-IP’s.

Run: dig ec2.us-gov-west-1.amazonaws.com. Replace this with your region but it should return the private A record for the FQDN which is the IP of the EC2 endpoint you just created.

Ok, if you got a good private A record then we are done with Endpoints, if not then troubleshoot. When you dig you should get a private A record as shown below.

Modify the CFE Declaration and Tag Route Table

Our last step is to modify the CFE declaration and tag the route table. If you remember my opening introduction, CFE is a declarative interface meaning we can’t use the GUI to configure this. We need to use the command line or use an application like Postman. Installing and configuring Postman for passing basic auth tokens is out of scope for this document but it’s not difficult and is well documented.

Let’s first run a GET to our management interfaces to see what is currently configured. This is documented in the CFE Quickstart section.

The response should return something very similar to the above. Check the defaultNextHopAddresses, they should be the External Self IP’s of your BIG-IP’s. If they match, then we only need to modify the scopingAddressRanges which need to be your VIP IP space which is the same subnet my external self-ips live on or 10.0.2.0/24. Here is my modified json declaration. Take note of the tag “cfe-failover-active-standby”, you will need your tag to update the route table tag.

Now let’s POST the updated declaration. You should receive a 200 response and the body should show your new updates. Run a GET on your other BIG-IP, it should show the same data.

Let’s update our route table. Go to VPC > Route Tables and find your route table. Then select Tags and add a tag that matches your labels, this is very important that the tag matches everything else in your environment. In my case, cfe-failover-active-standby is the value that is shown in my declaration and associated with my interfaces.

This completes the configuration, let’s test!

If you have a client and server configured, you can use these for testing after failover.

Log into your Active and Standby BIG-IP’s and take note of the virtual server statistics. Also take note of the private secondary IP as shown below on the Active BIG-IP, this IP should move over to the new Active BIG-IP when failover is initiated.

Go into your Active BIG-IP and force to standby. Depending on how busy AWS API gateway is this will determine on how fast failover occurs. After failover, test your application to see if traffic is now hitting the now Active BIG-IP.

You can follow along with the logs by logging into the CLI and tailing. You should see messages similar to the below if failover is successful.

Run this command from the CLI: tail -f /var/log/restnoded/restnoded.log

This completes using VPC Endpoints with Cloud Failover Extension.

Published Apr 27, 2020

Version 1.0

Employee

Solutions Architect currently covering the public sector.

View Profile

Noof

Employee

Solutions Architect currently covering the public sector.

View Profile

4 Comments

TJ_Vreugdenhil

Cirrus

May 04, 2020

Hi Noof - I followed the whole procedure, but I am getting a "Recovery operations are empty" error. Any recommendations?

[root@ip-10-10-8-28:Standby:In Sync] config # tail -f /var/log/restnoded/restnoded.log
 
Mon, 04 May 2020 21:04:17 GMT - info: [f5-cloud-failover] Performing failover - execute
 
Mon, 04 May 2020 21:04:17 GMT - warning: [f5-cloud-failover] Performing Failover - recovery
 
Mon, 04 May 2020 21:04:17 GMT - severe: [f5-cloud-failover] Recovery operations are empty, advise reset via the API Error: Recovery operations are empty, advise reset via the API
 
  at FailoverClient._getRecoveryOperations (/var/config/rest/iapps/f5-cloud-failover/nodejs/failover.js:373:19)
 
  at _getDeviceObjects.then.then.then (/var/config/rest/iapps/f5-cloud-failover/nodejs/failover.js:124:33)
 
  at tryCatcher (/usr/share/rest/node/node_modules/bluebird/js/release/util.js:16:23)
 
  at Promise._settlePromiseFromHandler (/usr/share/rest/node/node_modules/bluebird/js/release/promise.js:512:31)
 
  at Promise._settlePromise (/usr/share/rest/node/node_modules/bluebird/js/release/promise.js:569:18)
 
  at Promise._settlePromise0 (/usr/share/rest/node/node_modules/bluebird/js/release/promise.js:614:10)
 
  at Promise._settlePromises (/usr/share/rest/node/node_modules/bluebird/js/release/promise.js:693:18)
 
  at Async._drainQueue (/usr/share/rest/node/node_modules/bluebird/js/release/async.js:133:16)
 
  at Async._drainQueues (/usr/share/rest/node/node_modules/bluebird/js/release/async.js:143:10)
 
  at Immediate.Async.drainQueues (/usr/share/rest/node/node_modules/bluebird/js/release/async.js:17:14)
 
  at runCallback (timers.js:794:20)
 
  at tryOnImmediate (timers.js:752:5)
 
  at processImmediate [as _immediateCallback] (timers.js:729:5)

The Following JSON declartion is successful from both F51 and F52.

{
 
	"class": "Cloud_Failover",
 
	"environment": "aws",
 
	"externalStorage": {
 
		"scopingTags": {
 
			"f5_cloud_failover_label": "bigip-nonprod"
 
		}
 
	},
 
	"failoverAddresses": {
 
		"enabled": true,
 
		"scopingTags": {
 
			"f5_cloud_failover_nic_map_eth1": "NonProd-eth1-external",
 
			"f5_cloud_failover_nic_map_eth2": "NonProd-eth2-internal",
 
			"f5_cloud_failover_nic_map_eth3": "NonProd-eth3-internal2"
 
		},
 
		"failoverRoutes": {
 
			"enabled": true,
 
			"scopingTags": {
 
				"f5_cloud_failover_label": "bigip-nonprod-prod"
 
			},
 
			"scopingAddressRanges": [
 
			{
 
				"range": "10.10.116.0/24, 10.10.117.0/24, 10.200.116.0/24, "
 
			}
 
		],
 
			"defaultNextHopAddresses": {
 
				"discoveryType": "static",
 
				"items": [
 
					"10.200.116.105",
 
					"10.200.116.116",										
 
					"10.10.116.88",
 
					"10.10.116.56",
 
					"10.10.117.94",
 
					"10.10.117.232"
 
      ]
 
    }
 
  },
 
  "controls": {
 
   "class": "Controls",
 
   "logLevel": "silly"
 
  }
 
 }
 
}

And the dig does return a valid internal IP of the S3 Endpoint:

[root@ip-10-10-8-10:Standby:In Sync] config # dig ec2.us-east-2.amazonaws.com
 
hmac_link.c:350: FIPS mode is 1: MD5 is only supported if the value is 0.
 
Please disable either FIPS mode or MD5.
 
 
 
; <<>> DiG 9.11.8 <<>> ec2.us-east-2.amazonaws.com
 
;; global options: +cmd
 
;; Got answer:
 
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63858
 
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
 
 
;; QUESTION SECTION:
 
;ec2.us-east-2.amazonaws.com.  IN   A
 
 
 
;; ANSWER SECTION:
 
ec2.us-east-2.amazonaws.com. 60 IN   A    10.200.116.158
 
 
 
;; Query time: 3 msec
 
;; SERVER: 10.10.112.2#53(10.10.112.2)
 
;; WHEN: Mon May 04 16:05:47 CDT 2020
 
;; MSG SIZE rcvd: 61

Noof
Employee
May 05, 2020
Hello TJ,

You may want to open an issue at Github. I have yet to see this error.
https://github.com/F5Networks/f5-cloud-failover-extension/issues

What does your BIG-IP setup look like? HA across net or same net?
Did you setup the S3 gateway endpoint?
It is not happy about the recovery being empty. Do you have state in your S3 bucket? It should have backup folder and others.
Error: Recovery operations are empty, advise reset via the API
TJ_Vreugdenhil
Cirrus
May 05, 2020
Hey - Thanks
I just opened an issue at Github.
The setup is an F5 HA Pair in the same AZ. HA Configsync/Failover is working over internal NIC. However there is 2 NICS for internal, and 1 NIC for external, both acting as reverse proxies, so all three NICS are added to the declaration. Yes I created both the S3 gateway endpoint and S3 endpoint as indicated in this article. I will look at the S3 bucket - It does have the proper tag matching the declaration. Here is the "f5cloudfailoverstate.json" in the F5 S3 bucket:
{"taskState":"FAILED","message":"Failover failed because Recovery operations are empty, advise reset via the API","timestamp":"2020-05-05T17:03:39.568Z","instance":"ip-10-10-8-10.us-east-2.compute.internal","failoverOperations":{"addresses":null,"routes":null}}
Jeff_Giroux_F5
Ret. Employee
Jun 19, 2020
Sounds like you need to POST to /reset to reset state file

https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/troubleshooting.html#i-m-receiving-a-recovery-operations-are-empty-error-when-failover-is-triggered

Help guide the future of your DevCentral Community!

What tools do you use to collaborate? (1min - anonymous)