Per-app failover for Kubernetes-based services using F5 Distributed Cloud Services

Summary

F5 Distributed Cloud Services (F5 XC) offers an alternative or complimentary solution to Global Server Load Balancing (GSLB) for achieving High Availability (HA) or Disaster Recovery (DR). GSLB has been used for many years and is often employed when redundancy is required across multiple physical sites, although alternatives exist.

F5 XC connects disparate sites natively in a mesh, and can be used to achieve HA and DR without updating DNS or moving IP addresses. GSLB becomes optional in many cases. This article focuses on the use case of apps deployed on Kubernetes (K8s) clusters, but the concept applies to legacy and modern applications alike.

Background

A common question I hear is "I have a K8s cluster in Site A and an idential cluster in Site B. Can I fail over one app at a time?" The answer is YES! of course. To appreciate the detail, let's talk about K8s, how application delivery works, and the mechanics of what we're trying to achieve.

GLSB and load balancing across physical sites

While local load balancing often involves routing requests to multiple backend servers based on IP address, DNS-based failover is helpful when load balancing across disparate IP networks. A classic example uses F5's BIG-IP DNS solution, so this diagram is well-known to many seasoned network admins:

Alternatives to GSLB

Responding to a DNS request with a single IP from a pool of potential IP addresses is useful, but features like TTL and caching mean DNS updates aren't always immediate. One alternative is anycast routing, which allows a single IP address to exist in multiple locations, meaning you can achieve redundancy across physical locations without updating DNS. And if we remove network reachability from consideration for a moment, content-based routing (ie, routing traffic based on things like HTTP headers) would allow us to route traffic to multiple backend locations, regardless of the IP address that a client targets. Both of these are possible with F5 XC. 

Why are we talking about Kubernetes?

I really do get that question I mentioned previously. Customers tend to ask me about "multi-cluster redundancy" first, and then they wonder if they can fail over a single application at a time between clusters. This assumes they have the same application deployed in disparate clusters in multiple data centers, which is often the case. (More modern practices exist but are out of scope for today).

I've often helped customers set up per-app failover for K8s apps between clusters using F5 Container Ingress Services (CIS), which requires BIG-IP LTM and DNS. This is a great fit for existing customers with these modules. This solution is well documented, with some great user guides and a video walk-through too. A visual overview: 

 

But here's an alternative idea: set up K8s service discovery in F5 XC to discover services in multiple K8s clusters. Create multiple origin pools, and then publish a HTTP Load Balancer to either the public Internet, multiple internal sites, or both. No DNS updates are required if an origin pool becomes unavailable, since your HTTP LB is published to sites directly, or to the Internet via an anycast IP address.

(You could implement external or internal GSLB to complement this. Example: you prefer to advertise public IP addresses of Customer Edge [CE] devices directly and not use anycast via F5's Regional Edges, or perhaps you'll plan for redundancy if clients using an internal DNS record are pointing at a CE that fails).

Here's a simple diagram to show multiple k8s clusters, where discovered services can populate origin pools in F5 XC. Remember, we can publish the HTTP LB to multiple "consumer" sites, and/or the Internet, and our two clusters could be in the same XC site. The idea here is that we're using the F5 XC mesh to provide resiliency, and not updating DNS or moving IP addresses ourselves.

Benefits of using F5 XC for publishing K8s services

As a more "cloud-native" approach to app delivery, I love F5 XC. In the simple example above, if the datacenter hosting Cluster 1 were to "die" there would be no automatic DNS updates needed (unlike a GSLB approach). Client requests would simply be routed to the remaining healthy cluster. In that sense, there are serious advantages to using F5 XC for multi-site redundancy:

  • No updates to DNS at time of disaster removes potential complications with DNS, like TTL values, caching, etc.
  • No updating of IP addresses at time of disaster.
  • No extra configuration required by K8s staff (no CRD's or knowledge of failover required).

However it bears mentioning that F5 CIS is still a solid and preferred approach in many scenarios: 

  • Traversing BIG-IP en route to a K8s cluster allows more granular traffic controls than is available via F5 XC (apply iRules, traffic policies, TCP profiles, etc).
  • Cluster mode (directly targeting pods via CNI integration) is possible when using BIG-IP.
  • Strong enterprise adoption, scale, development team and support.

Advanced Kubernetes hosting

While many customers are still learning K8s generally, more advanced teams may use hosted K8s solutions. F5 XC offers Virtual K8s, which supports a centralized orchestration of applications across a fleet of sites (customer sites or F5 XC Regional Edge). This is a topic for another day - this article focuses on customer-hosted K8s clusters - and is a further step in the evolution toward more cloud-native application delivery, where planning for redundancy is largely shifted to the provider.

Related articles

Conclusion

When architecting for multi-cluster HA or DR, consider the mechanics of how your application delivery works, your application architecture, and your organizational skillset. F5 XC can be used to "stitch together" disparate networks natively and when coupled with native K8s service discovery, represents a strong solution for highly-available delivery of K8s-based apps.

Updated Apr 11, 2023
Version 2.0
No CommentsBe the first to comment