cancel
Showing results for 
Search instead for 
Did you mean: 
Jeff_Giroux
F5 Employee
F5 Employee

Introduction

Business and Application Owners want, demand, and should have...

Uptime! Resiliency! Scalability! Availability!

This article will provide information about BIG-IP and NGINX high availability (HA) topics that should be considered when leveraging the public cloud. There are differences between on-prem and public cloud such as modified or unique cloud provider L2 networking. These differences lead to challenges in how you address HA, failover time, peer setup, scaling options, and application state. I'll call out key items as well as provide recommendations based on my experience working with customers and testing in my lab.

Topics Covered:

  • Discuss and Define HA
  • Importance of Application Behavior and Traffic Sizing
  • HA Capabilities of BIG-IP and NGINX
  • Various HA Deployment Options (Active/Active, Active/Standby, auto scale)
  • Example Customer Scenario


Note: Cloud providers handles particular features differently than other providers. Where there are unique differences between cloud providers, there will be relevant links and/or notes to boost up your knowledge pointing to that specific cloud provider!

What is High Availability?

High availability can mean many things to different people. Depending on the application and traffic requirements, HA requires dual data paths, redundant storage, and compute. Oh, and don't forget redundant power supplies for on-prem stuff. HA as I’m defining it here means (among other reasons), the ability to survive a failure, maintenance windows should be seamless to user, and the user experience should never suffer...ever! I like to think of HA as providing that "always on" experience similar to electricity. When I flip on a light switch, electricity just...simply...works. Similarly when I visit my favorite web site or play my favorite online game, I expect it to just...simply...work.

Reference: https://en.wikipedia.org/wiki/High_availability

So what should HA provide?

  1. Synchronization of configuration data to peers (ex. configs objects)
  2. Synchronization of application session state (ex. persistence records)
  3. Enable traffic to fail over to a peer
  4. Locally, allow clusters of devices to act and appear as one unit
  5. Globally, disburse traffic via DNS and routing

Importance of Application Behavior and Traffic Sizing

I was in a customer meeting recently, and the discussion was focused specifically on session state and sizing. The use case was common...

"gaming app, lots of persistent connections, client needs to hit same backend throughout entire game session"

Session State

The requirement of session state is common across many applications and can use methods like HTTP cookies, custom F5 iRule persistence, JSessionID, IP affinity, hash, and more. The type of session state used by the application can help you decide what migration type is right for that app. Is this an app more fitting for a lift-n-shift approach...Rehost? Is this an app that can be re-coded a bit to take advantage of some cloud-native service...Replatform? Can the app be TOTALLY redesigned to take advantage of all native IaaS and PaaS technologies...Refactor? If you are not aware of the 6 R’s I’m referencing here, please check out the reference below.

Reference: 6 R's of a Cloud Migration

  1. Application session state allows user to have a consistent and reliable experience
  2. Auto scaling L7 proxies (BIG-IP or NGINX) keep track of session state
  3. Example = e-commerce sites, gaming servers, banking transactions
  4. **BIG-IP can only mirror session state to next device in cluster
  5. **NGINX can mirror state to all devices in cluster (via zone sync)

The latter two items will be discussed in more detail in upcoming sections.

Traffic Sizing

As for traffic sizing, well...you would hope the cloud takes care of most of that. It does a great job with things like auto scale, but there are still cloud provider limits that affect sizing and machine instance types to keep in mind. Since I'm talking specifically about BIG-IP and NGINX products, those are considered network virtual appliances (NVA)...or in other words just another compute instance.

  1. Cloud providers each have their own limits
  2. Example = networking limits, VM limits, number of flows
  3. Google GCP VPC Resource Limits
  4. Azure VM Flow Limits
  5. AWS Instance Types
  6. What if your current usage shows 4 million concurrent connections?
  7. What if your current usage shows 30,000 new SSL TPS?
  8. What if your current usage shows 50Gbps of throughput?

The latter three items can vary results on VM count/size depending on cloud provider limits. Unfortunately, not all limits are documented. Key metrics for L7 proxies are typically SSL stats, throughput, connection type, and connection count. Collecting requirements around application behavior and traffic sizing can help you choose the instance size as well as instance count. We have a list of the F5 supported BIG-IP VE platforms on F5 CloudDocs.

Next, we'll dive into the HA capabilities of the various F5 products and ways to deploy.

F5 Products and HA Capabilities

This section will cover the BIG-IP and NGINX capabilities for HA.

BIG-IP HA Capabilities

BIG-IP supports the following HA cluster configurations:

  1. Active/Active - all devices processing traffic
  2. Active/Standby - one device processes traffic, others wait in standby
  3. Configuration sync to all devices in cluster
  4. **Mirroring connections ONLY to next device in cluster
  5. L3/L4 connection info prevents service interruption upon failover (ex. avoids re-login)
  6. **Mirroring session state ONLY to next device in cluster
  7. L5-L7 state info to reach same backend server (ex. IP persistence, SSL persistence, iRule UIE persistence)

Reference: BIG-IP High Availability Docs

NGINX HA Capabilities

NGINX supports the following HA cluster configurations:

  1. Active/Active - all devices processing traffic
  2. Active/Standby - one device processes traffic, others wait in standby
  3. Configuration sync to all devices in cluster
  4. Mirroring connections at L3/L4 not available
  5. Mirroring session state to ALL devices in cluster using Zone Synchronization Module (NGINX Plus R15)
  6. Limited to sticky-learn session persistence, rate limiting info, and key-value stores

Reference: NGINX High Availability Docs

HA Methods for BIG-IP

In the following sections, I will illustrate 5 common deployment configurations for BIG-IP in public cloud.

  • HA for BIG-IP Design #1 - Active/Standby via API (single AZ)
  • HA for BIG-IP Design #2 - Active/Standby via API (multi AZ)
  • HA for BIG-IP Design #3 - A/A or A/S via LB (multi AZ)
  • HA for BIG-IP Design #4 - Auto Scale Active/Active via LB (multi AZ)
  • HA for BIG-IP Design #5 - Auto Scale Active/Active via DNS (multi AZ)

HA for BIG-IP Design #1 - Active/Standby via API (single AZ)

0151T000003WGOwQAO.png

  • Cloud provider load balancer is NOT required
  • Fail over time can be SLOW!
  • Only one device actively used (other device sits idle)
  • Use of single AZ deployments not recommended by cloud provider
  • Failover uses API calls to move cloud objects, times vary
  • AWS = 5-15sec observed in testing, longer depending on other API activity
  • Google = 5-15sec observed in testing, also depends on # of objects (ex. 200sec for many forwarding rules)
  • Azure = 30-90sec observed in testing, also seen 10min or more

When failover methods use API calls, the results are dependent upon the cloud provider processing that request, how fast, and in what fashion (bulk, sequentially). We use the F5 Cloud Failover Extension (CFE) for BIG-IP failover with the API method. I suggest you head over to the CFE page and take a look!

Key Findings:

  1. Google API failover times depend on number of forwarding rules
  2. Azure API extremely slow to disassociate/associate IPs to NICs (remapping)
  3. Azure API fast when updating routes (UDR, user defined routes)
  4. AWS seems reliable with API regarding IP moves and routes

Recommendations:

  1. We generally guide customers to not use a single AZ design for production
  2. Single AZ is useful for a single device, testing/dev
  3. Highly recommend consider using multi-AZ design (see next)

HA for BIG-IP Design #2 - Active/Standby via API (multi AZ)

0151T000003WGOyQAO.png

  • Similar pros/cons to HA in single AZ
  • Fail over method uses API
  • Fail over times can be slow
  • No cloud provider LB required

Key Difference:

  1. Multiple AZs are used
  2. Increased network/compute redundancy

Recommendations:

  1. This design with multi AZ is more preferred than single AZ
  2. Recommend when "traditional" HA cluster required or Lift-n-Shift...Rehost
  3. For Azure (based on my testing)...
  4. Recommend against API for IP failover due to long failover time
  5. Recommend instead using Azure UDR versus IP failover when possible
  6. Look at Failover via LB example instead for Azure
  7. If API method required, look at DNS solutions to provide further redundancy

HA for BIG-IP Design #3 - A/A or A/S via LB (multi AZ)

0151T000003WGPAQA4.png

  • Cloud LB health checks the BIG-IP for up/down status
  • Faster failover times (depends on cloud LB health timers)
  • Cloud LB allows A/A or A/S

Key difference:

  1. Multiple AZs are used
  2. Increased network/compute redundancy
  3. Cloud load balancer required

Recommendations:

  1. This design with multi AZ is more preferred than single AZ
  2. Use "failover via LB" or auto scale if you require faster failover times
  3. This applies to all cloud providers
  4. For Google (base on my testing)...
  5. Recommend against "via LB" for IPSEC traffic (Google LB not supported)
  6. If load balancing IPSEC, then use "via API" or "via DNS" failover methods

HA for BIG-IP Design #4 - Auto Scale Active/Active via LB (multi AZ)

0151T000003WGOzQAO.png

  • BIG-IP VE active/active and scales in/out
  • Traffic disbursed to VEs by cloud LB
  • Cloud LB health checks the VEs
  • BIG-IP instances will scale in/out depending on thresholds
  • AWS = Auto Scale groups
  • Azure = VM Scale Sets
  • Google = Target Instances and Target Pools
  • BIG-IP can auto heal (pave and nuke)
  • Cloud provider will launch new instance when overall health fails

Key difference:

  1. Multiple AZs are used
  2. Cloud LB required (extra cost, extra hop)
  3. BIG-IP devices auto heal
  4. Rolling upgrades to BIG-IP
  5. BIG-IP devices launched in "single-NIC" mode, throughput max 1.5 Gbps

Recommendations:

  1. Auto scale is great when traffic stats are unknown or seasonal (shrink/grow)
  2. Recommended for apps fitting a migration type of Replatform or Refactor

HA for BIG-IP Design #5 - Auto Scale Active/Active via DNS (multi AZ)

0151T000003WGPFQA4.png

  • BIG-IP VE active/active and scales in/out
  • Traffic disbursed to VEs by BIG-IP DNS (aka GTM)
  • BIG-IP DNS health checks the VEs
  • BIG-IP instances will scale in/out depending on thresholds
  • BIG-IP can auto heal (pave and nuke)

Key difference:

  1. Multiple AZs are used
  2. Cloud LB not required
  3. BIG-IP devices auto heal
  4. Rolling upgrades to BIG-IP
  5. BIG-IP devices launched in "single-NIC" mode, throughput max 1.5 Gbps
  6. DNS logic required by clients
  7. Each VIP on each BIG-IP will be an A record in the GTM pool

Recommendations:

  1. Good for apps that handle DNS resolution well upon failover events
  2. Recommend when cloud LB cannot handle particular protocol
  3. Recommend when customer is already using BIG-IP DNS (aka GTM)

HA Methods for NGINX

In the following sections, I will illustrate 2 common deployment configurations for NGINX in public cloud.

  • HA for NGINX Design #1 - Active/Standby via API (multi AZ)
  • HA for NGINX Design #2 - Auto Scale Active/Active via LB (multi AZ)

HA for NGINX Design #1 - Active/Standby via API (multi AZ)

0151T000003WGPKQA4.png

  • NGINX Plus required
  • Cloud provider load balancer is NOT required
  • Only one device actively used (other device sits idle)
  • Fail over times dependent on cloud provider
  • AWS 5-15 seconds
  • A/S not available in Google or Azure

Recommendations:

  1. Great when auto scale not needed
  2. Recommend when "traditional" HA cluster required or Lift-n-Shift...Rehost

Reference: Active-Passive HA for NGINX Plus on AWS

HA for NGINX Design #2 - Auto Scale Active/Active via LB (multi AZ)

0151T000003WGOeQAO.png

  • NGINX Plus required
  • Cloud LB health checks the NGINX
  • Faster failover times

Key difference:

  1. Multiple AZs are used
  2. Increased network/compute redundancy
  3. Cloud load balancer required

Recommendations:

  1. Auto scale is great when traffic stats are unknown or seasonal (shrink/grow)
  2. Recommended for apps fitting a migration type of Replatform or Refactor

Reference: Active-Active HA for NGINX Plus on AWS, Active-Active HA for NGINX Plus on Google

Example Customer Scenario #1

As a means to make this topic a little more real, here is a common customer scenario that shows you the decisions that go into moving an application to the public cloud. Sometimes it's as easy as a lift-n-shift, other times you might need to do a little more work. In general, public cloud is not on-prem and things might need some tweaking. Hopefully this example will give you some pointers and guidance on your next app migration to the cloud.

Current Setup:

  • Gaming applications
  • F5 Hardware BIG-IP VIRPIONs on-prem
  • Two data centers for HA redundancy
  • iRule heavy configuration (TLS encryption/decryption, payload inspections)
  • Session Persistence = iRule Universal Persistence (UIE), and other methods
  • Biggest app
  • 15K SSL TPS
  • 15Gbps throughput
  • 2 million concurrent connections
  • 300K HTTP req/sec (L7 with TLS)

Requirements for Successful Cloud Migration:

  1. Support current traffic numbers
  2. Support future target traffic growth
  3. Must run in multiple geographic regions
  4. Maintain session state
  5. Must retain all iRules in use

Recommended Design for Cloud Phase #1:

  • Migration Type: Hybrid model, on-prem + cloud, and some Rehost
  • Platform: BIG-IP
  • Retaining iRules means BIG-IP is required
  • Licensing: High Performance BIG-IP
  • Unlocks additional CPU cores past 8 (up to 24)
  • extra traffic and SSL processing
  • Instance type: check F5 supported BIG-IP VE platforms for accelerated networking (10Gb+)
  • HA method: Active/Standby and multi-region with DNS
  • iRule Universal persistence only mirrors to only next device, keep cluster size to 2
  • scale horizontally via additional HA clusters and DNS
  • clients pinned to a region via DNS (on-prem or public cloud)
  • inside region, local proxy cluster shares state

This example comes up in customer conversations often. Based on customer requirements, in-house skillset, current operational model, and time frames there is one option that is better than the rest. A second design phase lends itself to more of a Replatform or Refactor migration type. In that case, more options can be leveraged to take advantage of cloud-native features. For example, changing the application persistence type from iRule UIE to cookie would allow BIG-IP to avoid keeping track of state. Why? With cookies, the client keeps track of that session state. Client receives a cookie, passes the cookie to L7 proxy on successive requests, proxy checks cookie value, sends to backend pool member. The requirement for L7 proxy to share session state is now removed.

Example Customer Scenario #2

Here is another customer scenario. This time the application is a full suite of multimedia content. In contrast to the first scenario, this one will illustrate the benefits of rearchitecting various components allowing greater flexibility when leveraging the cloud. You still must factor in-house skill set, project time frames, and other important business (and application) requirements when deciding on the best migration type.

Current Setup:

  • Multimedia (Gaming, Movie, TV, Music) Platform
  • BIG-IP VIPRIONs using vCMP on-prem
  • Two data centers for HA redundancy
  • iRule heavy (Security, Traffic Manipulation, Performance)
  • Biggest App: oAuth + Cassandra for token storage (entitlements)

Requirements for Success Cloud Migration:

  1. Support current traffic numbers
  2. Elastic auto scale for seasonal growth (ex. holidays)
  3. VPC peering with partners (must also bypass Web Application Firewall)
  4. Must support current or similar traffic manipulating in data plane
  5. Compatibility with existing tooling used by Business

Recommended Design for Cloud Phase #1:

  • Migration Type: Repurchase, migration BIG-IP to NGINX Plus
  • Platform: NGINX
  • iRules converted to JS or LUA
  • Licensing: NGINX Plus
  • Modules: GeoIP, LUA, JavaScript
  • HA method: N+1
  • Autoscaling via Native LB
  • Active Health Checks

This is a great example of a Repurchase in which application characteristics can allow the various teams to explore alternative cloud migration approaches. In this scenario, it describes a phase one migration of converting BIG-IP devices to NGINX Plus devices. This example assumes the BIG-IP configurations can be somewhat easily converted to NGINX Plus, and it also assumes there is available skillset and project time allocated to properly rearchitect the application where needed.

Summary

OK! Brains are expanding...hopefully? We learned about high availability and what that means for applications and user experience. We touched on the importance of application behavior and traffic sizing. Then we explored the various F5 products, how they handle HA, HA designs, and my favorite...my own personal recommendations. These are of course my own recommendations and not F5 official recommendations. These recommendations are based on my own lab testing and interactions with customers. Every scenario will carry its own requirements, and all options should be carefully considered when leveraging the public cloud. Finally, we looked at a customer scenario, discussed requirements, and design proposal. Fun!

Appendix

Read the following articles for more guidance specific to the various cloud providers. The information provided earlier is meant to be more general across all clouds.

AWS and BIG-IP: Advanced Topologies and More on Highly Available Services

Azure and BIG-IP: Lightboard Lessons - BIG-IP Deployments in Azure

Google and BIG-IP: Failing Faster in the Cloud

F5 CloudDocs: BIG-IP VE on Public Cloud

Google and NGINX Plus: High-Availability Load Balancing with NGINX Plus on Google Cloud Platform

AWS and NGINX Plus: Using AWS Quick Starts to Deploy NGINX Plus

Azure and NGINX Plus: NGINX on Azure

Comments
Jeff_Giroux
F5 Employee
F5 Employee

This was a fun article. Reach out with any questions.

Ted_Byerly
F5 Employee
F5 Employee

Great write up. Articles like this a great to explain all the options and design considerations.

Version history
Last update:
‎04-May-2020 11:01
Updated by:
Contributors