on 22-May-2015 07:27
Working in the AWS public cloud; one has to adapt to a world of guaranteed failure at unpredictable times. Utilizing a combination of LTM for HA within a single availability zone and GTM across availability zones and regions provides an architecture to survive the chaos monkeys.
The following is part of a demo environment that I built for AWS, it highlights a couple of useful features of LTM/GTM including:
Here’s what the overall architecture looks like:
We create a couple of wide IP records to show different failure scenarios (these DNS records are isolated to my demo environment):
Here’s what things look like when everything is OK from an external user:
Or another view using a command line view with curl:
Taking a look from an internal user we can see that the behavior is slightly different. In this case a request from the US-EAST-1D AZ will always request from the same AZ when communicating with the active-active service. It also accesses the service using the internal IP address. This provides some cost savings if you have services that are very data heavy to avoid cross-AZ data billing charges.
Performing a failure of the D services (stopping the web server) we can see that initially connections fail while the client is still trying to access services use the US-EAST-1D IP address:
Once the client refreshes its DNS record (default 30 second TTL) we can see that it is now only communicating with US-EAST-E services:
To create this demo you’ll need:
Once you have these you can build out a standard LTM/GTM environment in AWS. Create a DNS cache (required for Topology LB). Point your AWS instances to your DNS cache listeners (make sure this is only accessible to your internal clients!!!). Build up some Split DNS / topology records.
You can also extend this example to go cross-Region as well (with the limitation that your internal IP space would not be accessible cross VPC)!
The example above illustrates a single failure scenario (loss of US-EAST-1D web services). You can imagine that there’s several other scenarios that could cause a failure including and not limited to:
The demo environment is built to survive these and keep on running despite the best efforts of any chaos monkeys.
Also take a look at https://devcentral.f5.com/s/articles/aws-advanced-ha-iapp for another way to perform cross-AZ failover.