DNS based failover between AWS Availability Zones and Split DNS

Working in the AWS public cloud; one has to adapt to a world of guaranteed failure at unpredictable times. Utilizing a combination of LTM for HA within a single availability zone and GTM across availability zones and regions provides an architecture to survive the chaos monkeys.

The following is part of a demo environment that I built for AWS, it highlights a couple of useful features of LTM/GTM including:

Monitoring LTM services from GTM
Using GTM for outbound DNS resolution
Creating split DNS records for EIP vs. internal VPC IP
Create topology LB records to avoid cross-AZ communication for active-active services

Demo

Here’s what the overall architecture looks like:

We create a couple of wide IP records to show different failure scenarios (these DNS records are isolated to my demo environment):

prefer-d.f5demo.com
- Active/Standby from D to E
prefer-e.f5demo.com
- Active/Standby from E to D
active-active.f5demo.com
- Round robin between D and E (unless internal request)

Here’s what things look like when everything is OK from an external user:

Or another view using a command line view with curl:

Taking a look from an internal user we can see that the behavior is slightly different. In this case a request from the US-EAST-1D AZ will always request from the same AZ when communicating with the active-active service. It also accesses the service using the internal IP address. This provides some cost savings if you have services that are very data heavy to avoid cross-AZ data billing charges.

Performing a failure of the D services (stopping the web server) we can see that initially connections fail while the client is still trying to access services use the US-EAST-1D IP address:

Once the client refreshes its DNS record (default 30 second TTL) we can see that it is now only communicating with US-EAST-E services:

Setting it up

To create this demo you’ll need:

LTM/GTM devices
Two AZ
Some backend services

Once you have these you can build out a standard LTM/GTM environment in AWS. Create a DNS cache (required for Topology LB). Point your AWS instances to your DNS cache listeners (make sure this is only accessible to your internal clients!!!). Build up some Split DNS / topology records.

You can also extend this example to go cross-Region as well (with the limitation that your internal IP space would not be accessible cross VPC)!

More Chaos

The example above illustrates a single failure scenario (loss of US-EAST-1D web services). You can imagine that there’s several other scenarios that could cause a failure including and not limited to: