DNS based failover between AWS Availability Zones and Split DNS

Working in the AWS public cloud; one has to adapt to a world of guaranteed failure at unpredictable times. Utilizing a combination of LTM for HA within a single availability zone and GTM across availability zones and regions provides an architecture to survive the chaos monkeys.

The following is part of a demo environment that I built for AWS, it highlights a couple of useful features of LTM/GTM including:

  • Monitoring LTM services from GTM
  • Using GTM for outbound DNS resolution
  • Creating split DNS records for EIP vs. internal VPC IP
  • Create topology LB records to avoid cross-AZ communication for active-active services

Demo

Here’s what the overall architecture looks like:

We create a couple of wide IP records to show different failure scenarios (these DNS records are isolated to my demo environment):

  • prefer-d.f5demo.com
    • Active/Standby from D to E
  • prefer-e.f5demo.com
    • Active/Standby from E to D
  • active-active.f5demo.com
    • Round robin between D and E (unless internal request)

Here’s what things look like when everything is OK from an external user:

Or another view using a command line view with curl:

Taking a look from an internal user we can see that the behavior is slightly different. In this case a request from the US-EAST-1D AZ will always request from the same AZ when communicating with the active-active service. It also accesses the service using the internal IP address. This provides some cost savings if you have services that are very data heavy to avoid cross-AZ data billing charges.

Performing a failure of the D services (stopping the web server) we can see that initially connections fail while the client is still trying to access services use the US-EAST-1D IP address:

Once the client refreshes its DNS record (default 30 second TTL) we can see that it is now only communicating with US-EAST-E services:

Setting it up

To create this demo you’ll need:

  • LTM/GTM devices
  • Two AZ
  • Some backend services

Once you have these you can build out a standard LTM/GTM environment in AWS. Create a DNS cache (required for Topology LB). Point your AWS instances to your DNS cache listeners (make sure this is only accessible to your internal clients!!!). Build up some Split DNS / topology records.

You can also extend this example to go cross-Region as well (with the limitation that your internal IP space would not be accessible cross VPC)!

More Chaos

The example above illustrates a single failure scenario (loss of US-EAST-1D web services). You can imagine that there’s several other scenarios that could cause a failure including and not limited to:

  • Loss of EAST-1E services
  • Loss of a single AZ
  • Loss of a single LTM/GTM device

The demo environment is built to survive these and keep on running despite the best efforts of any chaos monkeys.

Published May 22, 2015
Version 1.0