Customer Edge Site High Availability for Application Delivery - Reference Architecture
Purpose
This guide describes the reference architecture for deploying a highly available F5 Distributed Cloud (F5XC) Customer Edge (CE) site. It explains the networking options available to deploy a highly available multi-node CE site in an on-premise data center, branch location, or on the public cloud when deployed manually.
Audience
This guide is for technical readers, including NetOps and Solution Architect teams who want to better understand the various options for deploying a highly available F5 Distributed Cloud Customer Edge (CE) site.
The guide assumes the reader is familiar with basic networking concepts like routing protocols, DNS, and data center network architecture. Also, the reader must be aware of various F5XC concepts such as Load Balancing, BGP configuration, Sites and Virtual Sites, and Site Local Inside (SLI) and Site Local Outside (SLO) interfaces.
Introduction
To create a resilient network architecture, all components on the network must be deployed in a redundant topology to handle device and connectivity failures. A CE acts as an L7 gateway and sits in the path of the network traffic, hence it needs a redundant architecture. For a production setup, it is recommended to deploy the site as a three-node cluster. These three nodes are the control nodes. Additional worker nodes can be added for higher L7 and security performance.
Clustering on CE Site
A CE can be deployed as a multi-node site for redundancy and scaling performance. CE runs kubernetes on its nodes and inherits k8s HA architecture of having either one or three control nodes and optional worker nodes. Production deployments are recommended to have 3 control nodes for redundancy and additional worker nodes to meet the performance requirements of the site. A multi-node site can tolerate one control node failure as it needs at least 2 nodes to form the quorum for HA. It's important to ensure that multiple control nodes don't fail simultaneously in each site. Worker node failures do not cause the whole site to fail. It only reduces the total throughput the site can handle.
Note: The control nodes may also be addressed as master nodes in legacy documentation. Although they are called control nodes, they run both the control plane and data plane functions.
Figure: CE Clustering
In a multi-node setup, two CE control nodes form tunnels to the two closest REs. If one of the control nodes with a tunnel fails, it gets reassigned to the remaining control node. In a single node site, the same node forms tunnels to two different REs. Worker nodes are not supported for sites with a single control node.
Figure: RE – CE connectivity
CE Site HA Options
In a regular deployment, a multi-node CE site is used to achieve redundancy. A load balancer configured on the CE site uses the IP address of SLI, SLO, or both interfaces as VIP by default. But this means the load balancer domain/hostname will need to resolve to multiple IP addresses across the nodes of the CE. To simplify this, F5XC also allows users to specify a custom IPv4 address as the VIP for each load balancer.
An alternative topology is to use multiple single-node sites deployed across different availability zones in the data center or public cloud. In this case, the sites can be grouped into a Virtual Site. A load balancer can be configured with a custom VIP advertised to this Virtual Site.
Both of these options are explained in detail in the sections below.
High Availability Options for Single CE Site
This section describes the deployment options available to direct the traffic across the CE nodes and lists the pros and cons of each option. The feasibility of these options may vary by environment (on-premise or public cloud) and networking tools available. These nuances are also explained for each option.
For L4 and L7 load balancer VIPs on the CE site, all nodes (control and worker nodes) can actively receive the traffic. The site bandwidth scales linearly with the number of nodes. So, multiple worker nodes can be deployed based on the performance requirements of the site.
For public load balancers, the VIP is on RE, and the bandwidth is limited to the bandwidth of the tunnel connecting the two CE control nodes to the REs.
Layer 3 Redundancy Using Static Routing With ECMP
This is the simplest way to configure redundancy for Load Balancer VIP on the CE cluster. The application admin can configure the LB with a user-specified VIP and the Network admin can configure equal-cost static routes for this VIP IP addresses, with the SLI/SLO IP addresses of the CE nodes as the next hop. The router uses Equal Cost Multi-Path (ECMP) to spread the traffic across the CE nodes.
It is recommended to use consistent hashing ECMP configuration on the router to ensure an active session to a CE node is not rehashed in case another node fails.
Figure: Static Routing
Pros:
- VIP IP can be from any valid subnet. It is not restricted to the SLO or SLI subnet where it is advertised.
- Simple L3 routing configuration.
- Can scale with worker nodes with minimal route configuration change.
- All active nodes can receive the traffic.
Cons:
- Needs routing configuration changes external to F5XC, every time a new LB VIP is created/deleted.
- Traffic will get blackholed when a CE node fails, until the node’s route is removed from the route configuration, or the node is restored.
When To Use:
- When the NetOps team does not have access to routing devices with dynamic routing protocol capabilities like BGP.
- In use cases where the number of load balancers on the site is small and doesn’t change often, the operational overhead of configuring and managing the routes is less.
Layer 3 Redundancy Using BGP Routing With ECMP
BGP peering can be configured between F5XC CE and the router. This configuration requires LBs to be created with user-specified VIPs. The CE advertises equal cost, /32 routes to the VIP with the SLO/SLI as the next hop. The router uses Equal Cost Multi-Path (ECMP) to spread the traffic across the CE nodes.
It is recommended to use sticky/persistent ECMP configuration on the router to ensure an active session to a CE node is not rehashed to a different node in case of a node failure.
Note: Separate BGP peers must be configured for VIPs on SLO and SLI. Users can select the peer interface on CE while configuring the peers. For more information check BGP.
Figure: BGP Routing
Pros:
- VIP IP can be from any valid subnet. It is not restricted to the SLO or SLI subnet where it is advertised.
- Can automatically scale with worker nodes.
- Automatically revokes the route for failed CE node. Faster failover than any other method.
- All active nodes can receive the traffic.
Cons:
- Needs advanced network configuration on the router.
- The router must support BGP.
When To Use:
- The site has a large number of load balancers configured.
- Load balancers are frequently created and deleted.
- The application requires fast failover and minimal disruption in case of node failure.
- With network overlay technologies like Cisco ACI.
Layer 2 redundancy using VRRP/GARP
A user can enable VRRP on the CE site. This configuration requires LBs to be created with user-specified VIPs. Only control nodes participate in the VRRP redundancy group and one of them is elected as the leader for a VIP. VIPs will be randomly placed on different nodes. Only the leader node for the VIP sends out Gratuitous ARP (GARP) broadcasts for the VIP. If the leader node fails, a new leader is elected, and VIP is placed on it.
Figure: VRRP/GARP
Pros:
- No network configuration is required external to F5XC.
- Automatic failover of VIP when VRRP leader node fails.
Cons:
- Only control nodes can receive the traffic.
- Only one node is actively receiving traffic at a given time.
- VIPs will be placed on different control nodes randomly. Equal distribution of VIPs across control nodes is not guaranteed.
- Failover can be slow depending on the ARP resolution time on the network.
When To Use:
- The application team does not have access to routers, DNS servers, or load balancers on the network. (see other deployment options for details)
- The application does not require high throughput.
- Some traffic loss can be tolerated in case of node failure (e.g. for non-critical applications)
- Does not work in public cloud deployments as the cloud networking blocks GARP requests.
External Proxy Load Balancing
Network admin can configure an external load balancer (LB), with the CE SLO/SLI IP addresses in its origin pool, to spread the traffic across CE nodes. This can be a TCP or HTTP load balancer.
For an external TCP LB, the client IP will be lost as the LB will SNAT the request before forwarding it to the CE nodes. F5XC does not support proxy protocol on the client side so it cannot be used to convey the client IP to the load balancer on the CE.
For an external HTTP LB, the traffic will still get SNAT-ed, but the client IP can be persisted to the CE nodes if the external LB can add the X-Forwarded-For header to the request.
If the LB on the CE site is a HTTPS LB or TCP LB with TLS enabled, the external LB will have to host the TLS certificate as it will terminate the client TLS sessions. A wild card certificate can be used to simplify this issue, but this may not always be a viable option for the applications.
In case of public cloud deployments where L2 ARP and routing protocols may not work, in addition to TCP or HTTP load balancers, users also have the option to use Network Load Balancer on AWS and Google Cloud, Standard Load Balancer on Azure or similar feature on other public clouds, that does not SNAT the traffic but just forwards it to the CE nodes just like a router running ECMP.
Note: For multi-node public cloud sites created using the F5XC console, it will automatically create the required cloud native LB. But we also support manual CE deployment in the cloud in which case the user will have to create the LB.
Figure: External LB
Pros:
- All active nodes can receive the traffic.
- Health probes can be configured to get F5XC LB health to avoid traffic blackholing.
- Can scale with worker nodes.
- Works for public cloud deployments.
Cons:
- Managing certificates on external LB can be operationally challenging for TLS traffic.
- No source IP retention in the case of TCP LB
- Adds additional proxy hop.
- External LB can become a performance bottleneck even if CE is scaled out using worker nodes.
When To Use:
- There is an existing load balancer (usually in DMZ) in the traffic path, but CE is used for additional services like WAAP, DDoS, etc.
- In public clouds where cloud LBs can be used to load balance to the nodes.
DNS Load Balancing
Network admins can use DNS to resolve application hostnames to the SLI/SLO IP addresses on the CE Nodes. The DNS can be configured to respond with one IP at a time in a round robin manner. Alternatively, a private DNS LB or Global Server Load Balancer (GSLB) can be used which can make load-based intelligent decisions to distribute the traffic more evenly. User-specified VIPs must not be used in this case as the hostname must resolve to the individual node’s SLI/SLO IP address for the traffic to get routed to the node.
Figure: DNS LB
Pros:
- Can be configured using existing DNS servers.
- Can scale with worker nodes.
- Works for public cloud deployments. (Not recommended as better options are available)
- This does not add a L4/L7 hop to the traffic path.
Cons:
- Needs DNS configuration changes external to F5XC, every time a new LB VIP is created/deleted, or a new worker node is added.
- Traffic will get blackholed when a CE node fails, until the node’s IP is removed from the DNS configuration, or the node is restored.
- Intelligent distribution of traffic requires GSLB which can be expensive.
- Subject to DNS cache and TTL causing clients to resolve to a down CE node.
When To Use:
- The application team only has access to GSLB or DNS server and does not want to limit the traffic to only one node at a time as in the case of the VRRP/GARP option
- High performance is not a requirement as multiple clients may resolve to the same node even if the site has multiple nodes
- Can be used in the public cloud if the user does not want to create an external LB.
High Availability Using Multiple Single Node Sites Across Availability Zones
Instead of deploying a single multi-node site, customers can opt to deploy two (or more) single-node sites and use them together (as individual sites or grouped into a Virtual Site) to advertise a VIP.
This can be useful if the data center has two AZs so it’s more logical to deploy a CE on each AZ than deploying a three-node CE with one node in one AZ and two nodes in the other. By upgrading one site at a time, it is guaranteed that at least one site will always be online to serve the traffic, providing resiliency against upgrade failures. This is very useful in case of critical applications demanding zero downtime.
All the deployment options above, other than the VRRP/GARP method, can be used in this case. It is recommended to use a consistent hash configuration for ECMP on the router to ensure all packets in a TCP session from a client are always routed to the same site.
In this deployment, each CE has two tunnels to the nearest REs. Hence, this method is also beneficial when you want to publish an app to the internet using F5XC Regional Edges (REs) as you can scale throughput by adding CEs and hence more tunnels.
Note: This is a big advantage of this topology over a multi-node site as the latter is limited to only two tunnels.
For Public load balancers, the VIP is on the Regional Edges (REs) on the F5XC global network. The load balancing happens on the REs and the CEs provide secure connectivity with auto SNAT between REs and private origins. So, to get the most out of the available compute, the CEs in this case can be configured for Enhanced L3 performance mode as all the L7 processing happens on the RE.
Figure: RE-CE tunnels for multiple single-node site deployment
Conclusion
This guide should help the reader learn about the various HA options available in F5XC to make an informed decision on which method to choose based on their requirements and the networking tools available.
For a more detailed explanation of the above options with config examples also see: F5 Distributed Cloud – CE High Availability Options: A Comparative Exploration
Related Articles
F5XC Load Balancing and Distributed Proxy Concepts
F5 Distributed Cloud - Customer Edge Site - Deployment & Routing Options