F5 High Availability - Public Cloud Guidance

This article will provide information about BIG-IP and NGINX high availability (HA) topics that should be considered when leveraging the public cloud. There are differences between on-prem and public cloud such as cloud provider L2 networking. These differences lead to challenges in how you address HA, failover time, peer setup, scaling options, and application state.

Topics Covered:

Discuss and Define HA
Importance of Application Behavior and Traffic Sizing
HA Capabilities of BIG-IP and NGINX
Various HA Deployment Options (Active/Active, Active/Standby, auto scale)
Example Customer Scenario

What is High Availability?

High availability can mean many things to different people. Depending on the application and traffic requirements, HA requires dual data paths, redundant storage, redundant power, and compute. It means the ability to survive a failure, maintenance windows should be seamless to user, and the user experience should never suffer...ever!

Reference: https://en.wikipedia.org/wiki/High_availability

So what should HA provide?

Synchronization of configuration data to peers (ex. configs objects)
Synchronization of application session state (ex. persistence records)
Enable traffic to fail over to a peer
Locally, allow clusters of devices to act and appear as one unit
Globally, disburse traffic via DNS and routing

Importance of Application Behavior and Traffic Sizing

Let's look at a common use case...

"gaming app, lots of persistent connections, client needs to hit same backend throughout entire game session"

Session State

The requirement of session state is common across applications using methods like HTTP cookies, F5 iRule persistence, JSessionID, IP affinity, or hash. The session type used by the application can help you decide what migration path is right for you. Is this an app more fitting for a lift-n-shift approach...Rehost? Can the app be redesigned to take advantage of all native IaaS and PaaS technologies...Refactor?

Reference: 6 R's of a Cloud Migration

Application session state allows user to have a consistent and reliable experience
Auto scaling L7 proxies (BIG-IP or NGINX) keep track of session state
BIG-IP can only mirror session state to next device in cluster
NGINX can mirror state to all devices in cluster (via zone sync)

Traffic Sizing

The cloud provider does a great job with things like scaling, but there are still cloud provider limits that affect sizing and machine instance types to keep in mind. BIG-IP and NGINX are considered network virtual appliances (NVA). They carry quota limits like other cloud objects.

Unfortunately, not all limits are documented. Key metrics for L7 proxies are typically SSL stats, throughput, connection type, and connection count. Collecting these application and traffic metrics can help identify the correct instance type. We have a list of the F5 supported BIG-IP VE platforms on F5 CloudDocs.

F5 Products and HA Capabilities

BIG-IP HA Capabilities

BIG-IP supports the following HA cluster configurations:

Active/Active - all devices processing traffic
Active/Standby - one device processes traffic, others wait in standby
Configuration sync to all devices in cluster
L3/L4 connection sharing to next device in cluster (ex. avoids re-login)
L5-L7 state sharing to next device in cluster (ex. IP persistence, SSL persistence, iRule UIE persistence)

Reference: BIG-IP High Availability Docs

NGINX HA Capabilities

NGINX supports the following HA cluster configurations:

Active/Active - all devices processing traffic
Active/Standby - one device processes traffic, others wait in standby
Configuration sync to all devices in cluster
Mirroring connections at L3/L4 not available
Mirroring session state to ALL devices in cluster using Zone Synchronization Module (NGINX Plus R15)

Reference: NGINX High Availability Docs

HA Methods for BIG-IP

In the following sections, I will illustrate 3 common deployment configurations for BIG-IP in public cloud.

HA for BIG-IP Design #1 - Active/Standby via API
HA for BIG-IP Design #2 - A/A or A/S via LB
HA for BIG-IP Design #3 - Regional Failover (multi region)

HA for BIG-IP Design #1 - Active/Standby via API (multi AZ)

This failover method uses API calls to communicate with the cloud provider and move objects (IP address, routes, etc) during failover events. The F5 Cloud Failover Extension (CFE) for BIG-IP is used to declaratively configure the HA settings.

Cloud provider load balancer is NOT required
Fail over time can be SLOW!
Only one device actively used (other device sits idle)
Failover uses API calls to move cloud objects, times vary (see CFE Performance and Sizing)

Key Findings:

Google API failover times depend on number of forwarding rules
Azure API slow to disassociate/associate IPs to NICs (remapping)
Azure API fast when updating routes (UDR, user defined routes)
AWS reliable with API regarding IP moves and routes

Recommendations:

This design with multi AZ is more preferred than single AZ
Recommend when "traditional" HA cluster required or Lift-n-Shift...Rehost
For Azure (based on my testing)...
1. Recommend using Azure UDR versus IP failover when possible
2. Look at Failover via LB example instead for Azure
3. If API method required, look at DNS solutions to provide further redundancy

HA for BIG-IP Design #2 - A/A or A/S via LB (multi AZ)

Cloud LB health checks the BIG-IP for up/down status
Faster failover times (depends on cloud LB health timers)
Cloud LB allows A/A or A/S

Key difference:

Increased network/compute redundancy
Cloud load balancer required

Recommendations:

Use "failover via LB" if you require faster failover times
For Google (based on my testing)...
1. Recommend against "via LB" for IPSEC traffic (Google LB not supported)
2. If load balancing IPSEC, then use "via API" or "via DNS" failover methods

HA for BIG-IP Design #3 - Regional Failover via DNS (multi AZ, multi region)

BIG-IP VE active/active in multiple regions
Traffic disbursed to VEs by DNS/GSLB
DNS/GSLB intelligent health checks for the VEs

Key difference:

Cloud LB is not required
DNS logic required by clients
Orchestration required to manage configs across each BIG-IP
BIG-IP standalone devices (no DSC cluster limitations)

Recommendations:

Good for apps that handle DNS resolution well upon failover events
Recommend when cloud LB cannot handle a particular protocol
Recommend when customer is already using DNS to direct traffic
Recommend for applications that have been refactored to handle session state outside of BIG-IP
Recommend for customers with in-house skillset to orchestrate (Ansible, Terraform, etc)

HA Methods for NGINX

In the following sections, I will illustrate 2 common deployment configurations for NGINX in public cloud.

HA for NGINX Design #1 - Active/Standby via API
HA for NGINX Design #2 - Auto Scale Active/Active via LB

HA for NGINX Design #1 - Active/Standby via API (multi AZ)

NGINX Plus required
Cloud provider load balancer is NOT required
Only one device actively used (other device sits idle)
Only available in AWS currently

Recommendations:

Recommend when "traditional" HA cluster required or Lift-n-Shift...Rehost

Reference: Active-Passive HA for NGINX Plus on AWS

HA for NGINX Design #2 - Auto Scale Active/Active via LB (multi AZ)

NGINX Plus required
Cloud LB health checks the NGINX
Faster failover times

Key difference:

Increased network/compute redundancy
Cloud load balancer required

Recommendations:

Recommended for apps fitting a migration type of Replatform or Refactor

Reference: Active-Active HA for NGINX Plus on AWS, Active-Active HA for NGINX Plus on Google

Pros & Cons: Public Cloud Scaling Options

Review this handy table to understand the high level pros and cons of each deployment method.

Example Customer Scenario #1

As a means to make this topic a little more real, here is a common customer scenario that shows you the decisions that go into moving an application to the public cloud. Sometimes it's as easy as a lift-n-shift, other times you might need to do a little more work. In general, public cloud is not on-prem and things might need some tweaking. Hopefully this example will give you some pointers and guidance on your next app migration to the cloud.

Current Setup:

Gaming applications
F5 Hardware BIG-IP VIRPIONs on-prem
Two data centers for HA redundancy
iRule heavy configuration (TLS encryption/decryption, payload inspections)
Session Persistence = iRule Universal Persistence (UIE), and other methods
Biggest app
15K SSL TPS
15Gbps throughput
2 million concurrent connections
300K HTTP req/sec (L7 with TLS)

Requirements for Successful Cloud Migration:

Support current traffic numbers
Support future target traffic growth
Must run in multiple geographic regions
Maintain session state
Must retain all iRules in use

Recommended Design for Cloud Phase #1:

Migration Type: Hybrid model, on-prem + cloud, and some Rehost
Platform: BIG-IP
Retaining iRules means BIG-IP is required
Licensing: High Performance BIG-IP
Unlocks additional CPU cores past 8 (up to 24)
extra traffic and SSL processing
Instance type: check F5 supported BIG-IP VE platforms for accelerated networking (10Gb+)
HA method: Active/Standby and multi-region with DNS
iRule Universal persistence only mirrors to only next device, keep cluster size to 2
scale horizontally via additional HA clusters and DNS
clients pinned to a region via DNS (on-prem or public cloud)
inside region, local proxy cluster shares state

This example comes up in customer conversations often. Based on customer requirements, in-house skillset, current operational model, and time frames there is one option that is better than the rest. A second design phase lends itself to more of a Replatform or Refactor migration type. In that case, more options can be leveraged to take advantage of cloud-native features. For example, changing the application persistence type from iRule UIE to cookie would allow BIG-IP to avoid keeping track of state. Why? With cookies, the client keeps track of that session state. Client receives a cookie, passes the cookie to L7 proxy on successive requests, proxy checks cookie value, sends to backend pool member. The requirement for L7 proxy to share session state is now removed.

Example Customer Scenario #2

Here is another customer scenario. This time the application is a full suite of multimedia content. In contrast to the first scenario, this one will illustrate the benefits of rearchitecting various components allowing greater flexibility when leveraging the cloud. You still must factor in-house skill set, project time frames, and other important business (and application) requirements when deciding on the best migration type.

Current Setup:

Multimedia (Gaming, Movie, TV, Music) Platform
BIG-IP VIPRIONs using vCMP on-prem
Two data centers for HA redundancy
iRule heavy (Security, Traffic Manipulation, Performance)
Biggest App: oAuth + Cassandra for token storage (entitlements)

Requirements for Success Cloud Migration:

Support current traffic numbers
Elastic auto scale for seasonal growth (ex. holidays)
VPC peering with partners (must also bypass Web Application Firewall)
Must support current or similar traffic manipulating in data plane
Compatibility with existing tooling used by Business

Recommended Design for Cloud Phase #1:

Migration Type: Repurchase, migration BIG-IP to NGINX Plus
Platform: NGINX
iRules converted to JS or LUA
Licensing: NGINX Plus
Modules: GeoIP, LUA, JavaScript
HA method: N+1
Autoscaling via Native LB
Active Health Checks

This is a great example of a Repurchase in which application characteristics can allow the various teams to explore alternative cloud migration approaches. In this scenario, it describes a phase one migration of converting BIG-IP devices to NGINX Plus devices. This example assumes the BIG-IP configurations can be somewhat easily converted to NGINX Plus, and it also assumes there is available skillset and project time allocated to properly rearchitect the application where needed.

Summary

OK! Brains are expanding...hopefully? We learned about high availability and what that means for applications and user experience. We touched on the importance of application behavior and traffic sizing. Then we explored the various F5 products, how they handle HA, and HA designs. These recommendations are based on my own lab testing and interactions with customers. Every scenario will carry its own requirements, and all options should be carefully considered when leveraging the public cloud. Finally, we looked at a customer scenario, discussed requirements, and design proposal. Fun!