Multicloud Networking
68 TopicsF5 Distributed Cloud - Customer Edge Site - Deployment & Routing Options
F5 Distributed Cloud Customer Edge (CE) software deployment models for scale and routing for enterprises deploying multi-cloud infrastructure. Today's service delivery environments are comprised of multiple clouds in a hybrid cloud environment. How your multi-cloud solution attaches to your existing on-prem and cloud networks can be the difference between a successful overlay fabric, and one that leave you wanting more out of your solution. Learn your options with F5 Distributed Cloud Customer Edge software.14KViews19likes3CommentsKubernetes architecture options with F5 Distributed Cloud Services
Summary F5 Distributed Cloud Services (F5 XC) can both integrate with your existing Kubernetes (K8s) clusters and/or host a K8s workload itself. Within these distinctions, we have multiple architecture options. This article explores four major architectures in ascending order of sophistication and advantages. Architecture #1: External Load Balancer (Secure K8s Gateway) Architecture #2: CE as a pod (K8s site) Architecture #3: Managed Namespace (vK8s) Architecture #4: Managed K8s (mK8s) Kubernetes Architecture Options As K8s continues to grow, options for how we run K8s and integrate with existing K8s platforms continue to grow. F5 XC can both integrate with your existing K8s clusters and/or run a managed K8s platform itself. Multiple architectures exist within these offerings too, so I was thoroughly confused when I first heard about these possibilities. A colleague recently laid it out for me in a conversation: "Michael, listen up: XC can either integrate with your K8s platform, run inside your K8s platform, host virtual K8s (Namespace-aaS), or run a K8s platform in your environment." I replied, "That's great. Now I have a mental model for differentiating between architecture options." This article will overview these architectures and provide 101-level context: when, how, and why would you implement these options? Side note 1: F5 XC concepts and terms F5 XC is a global platform that can provide networking and app delivery services, as well as compute (K8s workloads). We call each of our global PoP's a Regional Edge (RE). RE's are highly meshed to form the backbone of the global platform. They connect your sites, they can expose your services to the Internet, and they can run workloads. This platform is extensible into your data center by running one or more XC Nodes in your network, also called a Customer Edge (CE). A CE is a compute node in your network that registers to our global control plane and is then managed by a customer as SaaS. The registration of one or more CE's creates a customer site in F5 XC. A CE can run on a hypervisor (VMWare/KVM/Etc), a Hyperscaler (AWS, Azure, GCP, etc), baremetal, or even as a k8s pod, and can be deployed in HA clusters. XC Mesh functionality provides connectivity between sites, security services, and observability. Optionally, in addition, XC App Stack functionality allows a large and arbitrary number of managed clusters to be logically grouped into a virtual site with a single K8s mgmt interface. So where Mesh services provide the networking, App Stack services provide the Kubernetes compute mgmt. Our first 2 architectures require Mesh services only, and our last two require App Stack. Side note 2: Service-to-service communication I'm often asked how to allow services between clusters to communicate with each other. This is possible and easy with XC. Each site can publish services to every other site, including K8s sites. This means that any K8s service can be reachable from other sites you choose. And this can be true in any of the architectures below, although more granular controls are possible with the more sophisticated architectures. I'll explore this common question more in a separate article. Architecture 1: External Load Balancer (Secure K8s Gateway) In a Secure Kubernetes Gateway architecture, you have integration with your existing K8s platform, using the XC node as the external load balancer for your K8s cluster. In this scenario, you create a ServiceAccount and kubeconfig file to configure XC. The XC node then performs service discovery against your K8s API server. I've covered this process in a previous article, but the advantage is that you can integrate with existing K8s platforms. This allows exposing both NodePort and ClusterIP services via the XC node. XC is not hosting any workloads in this architecture, but it is exposing your services to your local network, or remote sites, or the Internet. In the diagram above, I show a web application being accesssed from a remote site (and/or the Internet) where the origin pool is a NodePort service discovered in a K8s cluster. Architecture 2: Run a site within a K8s cluster (K8s site type) Creating a K8s site is easy - just deploy a single manifest found here. This file deploys multiple resources in your cluster, and together these resources work to provide the services of a CE, and create a customer site. I've heard this referred to as "running a CE inside of K8s" or "running your CE as a pod". However, when I say "CE node" I'm usually referring to a discreet compute node like a VM or piece of hardware; this architecture is actually a group of pods and related resources that run within K8s to create a XC customer site. With XC running inside your existing cluster, you can expose services within the cluster by DNS name because the site will resolve these from within the cluster. Your service can then be exposed anywhere by the F5 XC platform. This is similar to Architecture 1 above, but with this model, your site is simply a group of pods within K8s. An advantage here is the ability to expose services of other types (e.g. ClusterIP). A site deployed into a K8s cluster will only support Mesh functionality and does not support AppStack functionality (i.e., you cannot run a cluster within your cluster). In this architecture, XC acts as a K8s ingress controller with built-in application security. It also enables Mesh features, such as publishing of other sites' services on this site, and publishing of this site's discovered services on other sites. Architecture 3: vK8s (Namespace-as-a-Service) If the services you use include AppStack capabilities, then architectures #3 and #4 are possible for you. In these scenarios, our XC node actually runs your K8s on your workloads. We are no longer integrating XC with your existing K8s platform. XC is the platform. A simple way to run K8s workloads is to use a virtual k8s (vK8s) architecture. This could be referred to as a "managed Namespace" because by creating a vK8s object in XC you get a single namespace in a virtual cluster. Your Namespace can be fully hosted (deployed to RE's) or run on your VM's (CE's), or both. Your kubeconfig file will allow access to your Namespace via the hosted API server. Via your regular kubectl CLI (or via the web console) you can create/delete/manage K8s resources (Deployments, Services, Secrets, ServiceAccounts, etc) and view application resource metrics. This is great if you have workloads that you want to deploy to remote regions where you do not have infrastructure and would prefer to run in F5's RE's, or if you have disparate clusters across multiple sites and you'd like to manage multiple K8s clusters via a single centralized, virtual cluster. Best practice guard rails for vK8s With a vK8s architecture, you don't have your own cluster, but rather a managed Namespace. So there are some restrictions (for example, you cannot run a container as root, bind to a privileged port, or to the Host network). You cannot create CRD's, ClusterRoles, PodSecurityPolicies, or Namespaces, so K8s operators are not supported. In short, you don't have a managed cluster, but a managed Namespace on a virtual cluster. Architecture 4: mK8s (Managed K8s) In managed k8s (mk8s, also known as physical K8s or pk8s) deployment, we have an enterprise-level K8s distribution that is run at your site. This means you can use XC to deploy/manage/upgrade K8s infrastructure, but you manage the Kubernetes resources. The benefits include what is typical for 3rd-party K8s mgmt solutions, but also some key differentiators: multi-cloud, with automation for Azure, AWS, and GCP environments consumed by you as SaaS enterprise-level traffic control natively allows a large and arbitrary number of managed clusters to be logically managed with a single K8s mgmt interface You can enable kubectl access against your local cluster and disable the hosted API server, so your kubeconfig file can point to a global URL or a local endpoint on-prem. Another benefit of mK8s is that you are running a full K8s cluster at your site, not just a Namespace in a virtual cluster. The restrictions that apply to vK8s (see above) do not apply to mK8s, so you could run privileged pods if required, use Operators that make use of ClusterRoles and CRDs, and perform other tasks that require cluster-wide access. Traffic management controls with mK8s Because your workloads run in a cluster managed by XC, we can apply more sophisticated and native policies to K8s traffic than non-managed clusters in earlier architectures: Service isolation can be enforced within the cluster, so that pods in a given namespace cannot communicate with services outside of that namespace, by default. More service-to-service controls exist so that you can decide which services can reach with other services with more granularity. Egress control can be natively enforced for outbound traffic from the cluster, by namespace, labels, IP ranges, or other methods. E.g.: Svc A can reach myapi.example.com but no other Internet service. WAF policies, bot defense, L3/4 policies, etc—all of these policies that you have typically applied with network firewalls, WAF's, etc—can be applied natively within the platform. This architecture took me a long time to understand, and longer to fully appreciate. But once you have run your workloads natively on a managed K8s platform that is connected to a global backbone and capable of performing network and application delivery within the platform, the security and traffic mgmt benefits become very compelling. Conclusion: As K8s continues to expand, management solutions of your clusters make it possible to secure your K8s services, whether they are managed by XC or exist in disparate clusters. With F5 XC as a global platform consumed as a service—not a discreet installation managed by you—the available architectures here are unique and therefore can accommodate the diverse (and changing!) ways we see K8s run today. Related Articles Securely connecting Kubernetes Microservices with F5 Distributed Cloud Multi-cluster Multi-cloud Networking for K8s with F5 Distributed Cloud - Architecture Pattern Multiple Kubernetes Clusters and Path-Based Routing with F5 Distributed Cloud10KViews29likes5CommentsA complete Multi-Cloud Networking walkthrough with F5 Distributed Cloud
F5 Distributed Cloud – Multi-Cloud Networking F5 Distributed Cloud (F5 XC) provides a Software-as-a-Service based platform to connect, deliver, secure, and operate your networks and applications across any environment. This walkthrough contains two sections. The first section uses F5 Distributed Cloud Network Connect to network across cloud locations and providers with simplified provisioning and end-to-end security. The second part uses F5 Distributed Cloud App Connect, and shows how to securely connect distributed workloads across cloud and edge locations with integrated app security. Distributed Cloud Network Connect Network Connect helps customers establish a multi-cloud networking fabric with end-to-end cloud orchestration, a gateway that implements L3-L7 functions to enforce network connectivity and security and a unified policy with central visibility for collaboration across NetOps & SecOps. 1. Deploy F5 XC Customer Edge Site(s) Step 1: Establish a multi-cloud networking fabric by deploying F5 XC Customer Edge (CE) sites (cloud, edge, on-prem) ➡️ See the following article and connected video to learn how to use the Distributed Cloud Console to deploy a CE in AWS and in Azure, and then how to route traffic between each of the sites. Using F5 Distributed Cloud Network Connect to transit, route, & secure private cloud environments ➡️ F5 XC can orchestrate private connectivity, including AWS PrivateLink, Azure CloudLink, and many other private transport providers. The following article covers this capability in greater detail. Using F5 Distributed Cloud private connectivity orchestration for secure multi-cloud infrastructure Step 2: Customers onboard required VPC/VNets to the F5 XC CE sites to participate in the multi-cloud fabric. F5 XC then orchestrates cloud networking constructs to attract traffic from these VPCs (termed as spokes) and then enforce L3-L7 network services. Cloud orchestration includes things such as creating AWS TGW, route table updates, setting up Azure VNet peering, configuring AWS direct connect -or- Azure Express Route and related resources to establish private connectivity and many more. ➡️ See the following series of articles to learn how to use the Infrastructure as Code utility Terraform to deploy and connect Distributed Cloud CE’s in AWS, Azure, and Google Cloud Overview & AWS Deployment with F5 Distributed Cloud Multi-Cloud Networking AWS to Azure via Layer 3 & Global Network with F5 Distributed Cloud Multi-Cloud Networking Demo Guide: A step-by-step walkthrough using Terraform with Distributed Cloud Network Connect in AWS MCN 1: Deploy a F5 XC CE Site MCN 2: Cookie cutter architecture - fully orchestrated: attach spoke VPC/VNets seamlessly. MCN 3: Sites deployed across the globe to establish a multi-cloud networking fabric. 2. Configure Network Segments in Distributed Cloud Step 1: Configure Network Segments. These Network Segments will provide an end-to-end global isolated network. MCN 4: Configure a global Network Segment Step 2: Associate F5 XC CE Sites (incl. VLANs/interfaces for on-prem/edge sites), onboarded VPCs/VNets to these network segments to create an isolated network within the multi-cloud networking fabric. ➡️ Steps 4, 6, and 10+ in the following article show how to connect the Distributed Cloud Global Network use it to route traffic between different CE Sites Using F5 Distributed Cloud Network Connect to transit, route, & secure private cloud environments 3. Define Security Policies Step 1: Define security policies such as forward proxy policies, network security policies, traffic policers for your entire multi-cloud networking fabric with the power of labels to easily express the intent without complexities such as IP addresses. MCN 5: Enhanced Firewall Policy with the power of labels 4. Integrate with 3rd Party NFV services such as Palo Alto Networks Firewall Step 1: Seamlessly provision NFV services such as Big-IP AWAF, Palo Alto Networks Firewall, into any F5 XC CE site MCN 6: Orchestrate 3rd party firewalls like Palo Alto Step 2: Use the power of labels to easily express the intent to steer traffic to these 3rd party NFV appliances. MCN 7: Seamlessly steer traffic towards 3rd party NFV services such as PAN firewall ➡️ Learn how to deploy a Palo Alto Firewall using Distributed Cloud and a Palo Alto Panorama server, and then redirect traffic to the firewall using Enhanced Firewall Policies Easily Deploy Your Palo Alto NGFW with F5 Distributed Cloud Services 5. Monitor & Troubleshoot your Network NetOps and SecOps can collaborate using a single platform to monitor & troubleshoot networking issues across the multi-cloud fabric. MCN 8: Powerful monitoring dashboards & troubleshooting tools for your entire secure multi-cloud network fabric. Distributed Cloud App Connect App Connect helps customers simply deliver applications across their multi-cloud networking fabric including the internet without worrying about underlying networking via the distributed proxy architecture with full self-service capability and application isolation via namespaces. 1. Establish a Secure Multi-Cloud Network Fabric Utilize Multi-Cloud Network Connect to deploy F5 XC CE sites in environments that host your applications. 2. Discover Any App running Anywhere Step 1: Simply discover all apps running across your environments by configuring service discoveries. Use DNS based service discovery to discover legacy apps and K8s/consul-based service discovery to discover modern apps. MCN 9: Discover apps in any environment - sample showing apps discovered in a K8s cluster. 3. Deliver Any App Anywhere, incl. the Public Internet Step 1: Configure a Load Balancer which will connect apps (Origins) discovered in any environment and then deliver it (Advertise) to any environment. MCN 10: Leverage distributed proxy architecture to connect an App running in Azure to AWS – without configuring ANY networking. Step 2: Apps can be delivered (Advertised) directly to the internet using F5 XC’s performant anycast global backbone, with DNS delegation & TLS cert management by simply selecting VIP advertisement as ‘Internet’. MCN 11: Live traffic graph showing seamlessly connecting App in Azure -> AWS and then delivering the App in AWS to the public internet. ➡️ Navigate each step of the process, from deploying CE’s to using App Connect to connect app services locally and advertise the frontend to the Internet. The following collection of articles use the Distributed Cloud Console to facilitate the deployment, and demonstrate how to automate the process using the Infrastructure as Code utility Terraform to orchestrate everything. Use F5 Distributed Cloud to Connect Apps Running in Multiple Clusters and Sites Azure & Layer 7 Networking with F5 Distributed Cloud Multi-Cloud Networking Demo Guide: Using Terraform to connect backend-send services via Distributed Cloud App Connect in Azure 4. Secure your Apps Step 1: Secure Apps with industry leading application security services such as WAF, Bot, L7 DoS, API security, client-side defense and many more with a single click. MCN 12: One click application security for all your applications – anywhere ➡️ The following demo guide shows how to deploy web app globally and secure it. Distributed Cloud WAAP + CDN Demo Guide 5. Monitor & Troubleshoot your Apps SecOps, NetOps and DevOps can collaborate using a single platform to monitor & troubleshoot application issues across the multi-cloud fabric. MCN 13: Performance & Security dashboards for every application namespace - each namespace contains many load balancers. MCN 14: Performance & Security dashboard for each Load Balancer MCN 15: Various other security & performance tools to help maintain a healthy secure performant multi-cloud application fabric. Conclusion Using the Network Connect and App Connect services in Distributed Cloud, it's easy to deploy, connect, and secure apps that run in multiple clouds. The F5 platform automatically handles the connectivity, routing, and allows customized access, enabling apps to be deployed globally or privately in just a few clicks. Additional Resources Distributed Cloud Network Connect Distributed Cloud App Connect Demo Guide: F5 XC MCN7.3KViews4likes1CommentF5 Distributed Cloud - Regional Decryption with Virtual Sites
In this article we discuss how the F5 Distributed Cloud can be configured to support regulatory demands for TLS termination of traffic to specific regions around the world. The article provides insight into the F5 Distributed Cloud global backbone and application delivery network (ADN). The article goes on to inspect how the F5 Distriubted Cloud is able to achieve these custom topologies in a multi-tenant architecture while adhearing to the "rules of the internet" for route summarization. Read on to learn about the flexibility of F5's SaaS platform providing application delivery and security solutions for your applications.7KViews17likes2CommentsMulti-Cluster, Multi-Cloud Networking for K8S with F5 Distributed Cloud – Architecture Pattern
Application is the center of attention for majority organization. It is an important asset for business. Application evolves from simple application to complex application with multitude of integrated systems and distributed – simple to complex. Application becoming so complex. Complexities are the real threat to business regardless from security or operation aspect. Security, clouds, multi-cloud networking and so on exist because of application. Those technologies will cease to exist without existence of application. F5 Distributed Cloud is design to address business most important asset - application. It bring F5 vision to reality for our customer - “Secure, Deliver and Optimize every app and API anywhere”5.5KViews8likes3CommentsF5 Hybrid Security Architectures (Part 2 - F5's Distributed Cloud WAF and NGINX App Protect WAF)
Here in this example solution, we will be using Terraform to deploy an AWS Elastic Kubernetes Service cluster running the Arcadia Finance test web application serviced by F5 NGINX Kubernetes Ingress Controller and protected by NGINX App Protect WAF. We will supplement this with F5 Distributed Cloud Web App and API Protection to provide complimentary security at the edge. Everything will be tied together using GitHub Actions for CI/CD and Terraform Cloud to maintain state.5.2KViews4likes0CommentsUse F5 Distributed Cloud to Connect Apps Running in Multiple Clusters and Sites
Introduction Modern apps are comprised of many smaller components and can take advantage of today’s agile computing landscape. One of the challenges IT Admins and Security Operations face is securely controlling access to all the compofnents of distributed apps while business development grows or changes hands with mergers and acquisitions, or as contracts change. F5 Distributed Cloud (F5 XC) makes it very easy to provide uniform access to distributed apps regardless of where the components live. Solution Overview Arcadia Finance is a distributed app with modules that run in multiple Kubernetes clusters and in multiple locations. To expedite development in a key part of the Arcadia Finance distributed app, the business has decided to outsource work on the Refer A Friend module. IT Ops must now relocate the Refer A Friend module to a separate location exclusive to the new contractor where its team of developers have access to work on it. Because the app is modular, IT has shared a copy of the Refer A Friend container to the developer, and now that it is up and running in the new site, traffic to the module needs to transition away from the one that had been developed in house to the one now managed by the contractor. Logical Topology Distributed App Overview The Refer A Friend endpoint is called by the Arcadia Finance frontend pod in a Kubernetes (K8s) cluster when a user of the service wants to invite a friend to join. The pod does this by making an HTTP request to the location “refer-a-friend.demo.internal/app3/”. The endpoint “refer-a-friend.demo.internal” is registered as a discoverable service in the K8s cluster using an F5 XC App Connect HTTP Load Balancer with its VIP configured to be advertised internally to specific sites, including the K8s cluster. F5 XC uses the cluster’s K8s management API to register service names and make them available anywhere within a global network belonging to a customer's tenant. Three sites are used by the company that owns Arcadia Finance to deliver the distributed app. The core of the app lives in a K8s cluster in Azure, the administration and monitoring of the app is in the customer’s legacy site in AWS. To maintain security, the new contractor only has access to GCP where they’ll continue developing the Refer A Friend module. An F5 XC global virtual network connects all three sites, and all three sites are in a site mesh group to streamline communication between the different app modules. Steps to deploy To reach the app externally, an App Connect HTTP Load Balancer policy is configured using an origin pool that connects to the K8s “frontend” service, and the origin pool uses a Kubernetes Site in F5 XC to access the frontend service. A second HTTP Load Balancer policy is configured with its origin pool, a static IP that lives in Azure and is accessed via a registered Azure VNET Site. When the Refer A Friend module is needed, a pod in the K8s cluster connects to the Refer A Friend internal VIP advertised by the HTTP Load Balancer policy. This connection is then tunneled by F5 XC to an endpoint where the module runs. With development to the Refer A Friend module turned over to the contractor, we only need to change the HTTP Load Balancer policy to use an origin pool located in the contractor’s Cloud GCP VPC Site. The origin policy for the GCP located module is nearly identical to the one used in Azure. Now when a user in the Arcadia App goes to refer a friend, the callout the app makes is now routed to the new location where it is managed and run by the new contractor. Demo Watch the following video for information about this solution and a walkthrough using the steps above in the F5 Distributed Cloud Console. Conclusion Using Distributed Cloud with modern day distributed apps, it’s almost too easy to route requests intended for a specific module to a new location regardless of the provider and provider specific requirements or the IP space the new module runs in. This is the true power of using Distributed Cloud to glue together modern day distributed apps.4.2KViews4likes0CommentsUnderstanding Modern Application Architecture - Part 1
This is part 1 of a series. Here are the other parts: Understanding Modern Application Architecture - Part 2 Understanding Modern Application Architecture - Part 3 Over the past decade, there has been a change taking place in how applications are built. As applications become more expansive in capabilities and more critical to how a business operates, (or in many cases, the application is the business itself) a new style of architecture has allowed for increased scalability, portability, resiliency, and agility. To support the goals of a modern application, the surrounding infrastructure has had to evolve as well. Platforms like Kubernetes have played a big role in unlocking the potential of modern applications and is a new paradigm in itself for how infrastructure is managed and served. To help our community transition the skillset they've built to deal with monolithic applications, we've put together a series of videos to drive home concepts around modern applications. This article highlights some of the details found within the video series. In these first three videos, we breakdown the definition of a Modern Application. One might think that by name only, a modern application is simply an application that is current. But we're actually speaking in comparison to a monolithic application. Monolithic applications are made up of a single, or a just few pieces. They are rigid in how they are deployed and fragile in their dependencies. Modern applications will instead incorporate microservices. Where a monolithic application might have all functions built into one broad encompassing service, microservices will break down the service into smaller functions that can be worked on separately. A modern application will also incorporate 4 main pillars. Scalability ensures that the application can handle the needs of a growing user base, both for surges as well as long term growth. Portability ensures that the application can be transportable from its underlying environment while still maintaining all of its functionality and management plane capabilities. Resiliency ensures that failures within the system go unnoticed or pose minimal disruption to users of the application. Agility ensures that the application can accommodate for rapid changes whether that be to code or to infrastructure. There are also 6 design principles of a modern application. Being agnostic will allow the application to have freedom to run on any platform. Leveraging open source software where it makes sense can often allow you to move quickly with an application but later be able to adopt commercial versions of that software when full support is needed. Defining by code allows for more uniformity of configuration and move away rigid interfaces that require specialized knowledge. Automated CI/CD processes ensures the quick integration and deployment of code so that improvements are constantly happening while any failures are minimized and contained. Secure development ensures that application security is integrated into the development process and code is tested thoroughly before being deployed into production. Distributed Storage and Infrastructure ensures that applications are not bound by any physical limitations and components can be located where they make the most sense. These videos should help set the foundation for what a modern application is. The next videos in the series will start to define the fundamental technical components for the platforms that bring together a modern application. Continued in Part 24.1KViews8likes0CommentsWhat is Multi-Cloud Networking?
What is Multi-Cloud Networking? Multi-cloud networking (MCN), as a technology, aims to provide easy network connectivity between cloud environments. For the purpose of our definition, we need to imagine our datacenter as a cloud. You can loosely define a cloud environment as 'anywhere you run workloads.' The concept is nebulous… literally. Clouds come in all shapes and sizes, from 'running on Pi' to AWS / GCP / Azure. MCN is to clouds, as Internet is to network. AWS’s DirectConnect, Azure ExpressRoute and GCP’s Direct Link were early forms of MCN, aimed at joining portions of their own clouds together with customer datacenters. Insertion of transport virtual appliances in clouds has become another mechanism for MCN through time. Its strength is its flexibility and agility. One other notable MCN concept is the transport provider. Some circuit providers offer 'short-hop' transport to various cloud providers by routing. This option offers significant throughput versus the SDN router but lacks the agility. This is a popular option for hybrid cloud enterprises. With all of these options, you can make individual connections to each cloud, potentially in a hub and spoke fashion or full mesh. Challenges With Multi-Cloud Networking The top-most concern should be scalability, in every way. You need to be concerned about scale in routing, licensing, metered cloud costs, not to mention the knowledge to understand all of the nuance features of each cloud provider and so many more things. All of this is operational overhead, which can be significant. Another serious challenge is IP addressing. The sheer volume of it is one thing. Anyone who works with modern applications today can tell you that it's hard to even find a workload sometimes, with how massive things can get. DNS is one possible option to assist, but you've got to account for all of the native cloud workloads, too.. with their different DNS interfaces. Another common challenge is IP overlap. If you're curious what I mean, lets say your employer acquires a piece of software that lives in GCP, but you're already in AWS. You start going down the path of routing when you suddenly notice that both cloud environments are 10.1.x.x/16. This means localized routing all over the place and we know how much router people love one-offs, am I right? The next challenge is one I've already hinted at: How many indepth nerd knobs do you want to know by how many security vendors? You've got to strategize to minimize this sort of potential sprawl and standardize on the vendors that can do the most for you. Advantages of Multi-Cloud Networking The greatest advantage is really multi-cloud transit. Understanding so many different and new technologies is a daunting task. With multi-cloud transit, data centers route through the same SDN routers as your cloud application flows, allowing you to see each cloud provider as a metered resource for app consumption. No need to worry about addressing, DNS, or routing for each environment. Another substantial benefit is the enablement of a shared security model. When you can route between these environments, you can also easily aggregate logs, integrate with SIEMs and manage automated security policies with ease. Network fluidity is another substantial benefit. When your COO comes to you and says that you need to integrate a newly acquired network segment, you have no problems. One of the very cool benefits of SDN is the ability to route by software object. When we think of routing in traditional networks we want our packet to get to 10.10.10.4 by way of 192.168.3.1, but an SDN router sends our packet to 10.10.10.4 by way of f5xc_gcp_router4. This also means that your app developer can stamp out their app in AWS to send another packet to 10.10.10.4 by way of f5xc_aws_router16 or such. Overlap no longer matters when you route through an SDN core. Conclusion Giving your modern application networks the flexibility to grow on demand, to assimilate new application network segments in minutes instead of months... Ultimately, I really believe that MCN - when done right - like Chuck Mangione said (well, with a flugel horn), 'Feels So Good.' The designs you can build with it are SO much more scalable and translate everything from physical data centers to clouds in a clean, easy to manage fashion.4KViews8likes0CommentsF5 Distributed Cloud – CE High Availability Options: A Comparative Exploration
This article explores an alternative approach to achieve HA across single CE nodes, catering for use cases requiring higher performance and granular control over redundancy and failover management. Introduction F5 Distributed Cloud offers different techniques to achieve High Availability (HA) for Customer Edge (CE) nodes in an active-active configuration to provide redundancy, scaling on-demand and simplify management. By default, F5 Distributed Cloud uses a method for clustering CE nodes, in which CEs keep track of peers by sending heartbeats and facilitating traffic exchange among themselves. This method also handles the automatic transfer of traffic, virtual IPs, and services between CE peers —excellent for simplified deployment and running App Stack sites hosting Kubernetes workloads. However, if CE nodes are deployed mainly to manage L3/L7 traffic and application security, this default model might lack the flexibility needed for certain scenarios. Many of our customers tell us that achieving high availability is not so straightforward with the current clustering model. These customers often have a lot of experience in managing redundancy and high availability across traditional network devices. They like to manage everything themselves—from scheduling when to switch over to a redundant pair (planned failover), to choosing how many network paths (tunnels) to use between CEs to REs (Regional Edges) or other CEs. They also want to handle any issues device by device, decide the number of CE nodes in a redundancy group, and be able to direct traffic to different CEs when one is being updated. Their feedback inspired us to write this article, where we explore a different approach to achieve high availability across CEs. The default clustering model is explained in this document: https://docs.cloud.f5.com/docs/ves-concepts/site#cluster-of-nodes Throughout this article, we will dive into several key areas: An overview of the default CE clustering model, highlighting its inherent challenges and advantages. Introduction to an alternative clustering strategy: Single Node Clustering, including: An analysis of its challenges and benefits. Identification of scenarios where this approach is most applicable. A guide to the configuration steps necessary to implement this model. An exploration of failover behavior within this framework. A comparison table showing how this new method differs from the default clustering method. By the end of this article, readers will gain an understanding of both clustering approaches, enabling informed decisions on the optimal strategy for their specific needs. Default CE Clustering Overview In a standard CE clustering setup, a cluster must have at least three Master nodes, with subsequent additions acting as Worker nodes. A CE cluster is configured as a "Site," centralizing operations like pool configuration and software upgrades to simplify management. In this clustering method, frequent communication is required between control plane components of the nodes on a low latency network. When a failover happens, the VIPs and services - including customer’s compute workloads - will transition to the other active nodes. As shown in the picture above, a CE cluster is treated as a single site, regardless of the number of nodes it contains. In a Mesh Group scenario, each mesh link is associated with one single tunnel connected to the cluster. These tunnels are distributed among the master nodes in the cluster, optimizing the total number of tunnels required for a large-scale Mesh Group. It also means that the site will be connected to REs only via 2 tunnels – one to each RE. Design Considerations for Default CE Clustering model: Best suited for: 1- App Stack Sites: Running Kubernetes workloads necessitates the default clustering method for container orchestration across nodes. 2- Large-scale Site-Mesh Groups (SMG) 3- Cluster-wide upgrade preference: Customers who favour managing nodes collectively will find cluster-wide upgrades more convenient, however without control over the upgrade sequence of individual nodes. Challenges: o Network Bottleneck for Ingress Traffic: A cluster connected to two Regional Edge (RE) sites via only 2 tunnels can lead to only two nodes processing external (ingress) traffic, limiting the use of additional nodes to process internal traffic only. o Three-master node requirement: Some customers are accustomed to dual-node HA models and may find the requirement for three master nodes resource-intensive. o Hitless upgrades: Controlled, phased upgrades are preferred by some customers for testing before widespread deployment, which is challenging with cluster-wide upgrades. o Cross-site deployments: High network latency between remote data centers can impact cluster performance due to the latency sensitivity of etcd daemon, the backbone of cluster state management. If the network connection across the nodes gets disconnected, all nodes will most likely stop the operation due to the quorum requirements of etcd. Therefore, F5 recommends deploying separate clusters for different physical sites. o Service Fault Sprawl and limited Node fault tolerance: Default clusters can sometimes experience a cascading effect where a fault in a node spreads throughout the cluster. Additionally, a standard 3-node cluster can generally only tolerate the failure of two nodes. If the cluster was originally configured with three nodes, functionality may be lost if reduced to a single active node. These limitations stem from the underlying clustering design and its dependency on etcd for maintaining cluster state. The Alternative Solution: HA Between Multiple Single Nodes The good news is that we can achieve the key objectives of the clustering – which are streamlined management and high availability - without the dependency on the control plane clustering mechanisms. Streamlined management using “Virtual Site”: F5 Distributed Cloud provides a mechanism called “Virtual Site” to perform operations on a group of sites (site = node or cluster of nodes), reducing the need to repeat the same set of operations for each site. The “Virtual Site” acts as an abstraction layer, grouping nodes tagged with a unique label and allows collectively addressing these nodes as a single entity. Configuration of origin pools and load balancers can reference Virtual Sites instead of individual sites/nodes, to facilitate cluster-like management for two or more nodes and enabling controlled day 2 operations. When a node is disassociated from Virtual Site by removing the label, it's no longer eligible for new connections, and its listeners are simultaneously deactivated. Upgrading nodes is streamlined: simply remove the node's label to exclude it from the Virtual Site, perform the upgrade, and then reapply the label once the node is operational again. This procedure offers you a controlled failover process, ensuring minimal disruption and enhanced manageability by minimizing the blast radius and limiting the cope of downtime. As traffic is rerouted to other CEs, if something goes wrong with an upgrade of a CE node, the services will not be impacted. HA/Redundancy across multiple nodes: Each single node in a Virtual Site connect to dual REs through IPSec or SSL/TLS tunnels, ensuring even load distribution and true active-active redundancy. External (Ingress) Traffic: In the Virtual Site model, the Regional Edges (REs) distribute external traffic evenly across all nodes. This contrasts with the default clustering approach where only two CE nodes are actively connected to the REs. The main Virtual Site advantage lies in its true active/active configuration for CEs, increasing the total ingress traffic capacity. If a node becomes unavailable, the REs will automatically reroute the new connections to another operational node within the Virtual Site, and the services (connection to origin pools) remain uninterrupted. Internal (East-West) Traffic: For managing internal traffic within a single CE node in a Virtual Site (for example, when LB objects are configured to be advertised within the local site), all network techniques applicable to the default clustering model can be employed in this model as well, except for the Layer 2 attachment (VRRP) method. Preferred load distribution method for internal traffic across CEs: Our preferred methods for load balancing across CE nodes are either DNS based load balancing or Equal-Cost Multi-Path (ECMP) routing utilizing BGP for redundancy. DNS Load Balancer Behavior: If a node is detached from a Virtual Site, its associated listeners and Virtual IPs (VIPs) are automatically withdrawn. Consequently, the DNS load balancer's health checks will mark those VIPs as down and prevent them from receiving internal network traffic. Current limitation for custom VIP and BGP: When using BGP, please note a current limitation that prevents configuring a custom VIP address on the Virtual Site. As a workaround, custom VIPs should be advertised on individual sites instead. The F5 product team is actively working to address this gap. For a detailed exploration of traffic routing options to CEs, please refer to the following article here: https://community.f5.com/kb/technicalarticles/f5-distributed-cloud---customer-edge-site---deployment--routing-options/319435 Design Considerations for Single Node HA Model: Best suited for: 1- Customers with high throughput requirement: This clustering model ensures that all Customer Edge (CE) nodes are engaged in managing ingress traffic from Regional Edges (REs), which allows for scalable expansion by adding additional CEs as required. In contrast, the default clustering model limits ingress traffic processing to only two CE nodes per cluster, and more precisely, to a single node from each RE, regardless of the number of worker nodes in the cluster. Consequently, this model is more advantageous for customers who have high throughput demands. 2- Customers who prefer to use controlled failover and software upgrades This clustering model enables a sequential upgrade process, where nodes are updated individually to ensure each node upgrades successfully before moving on to the other nodes. The process involves detaching the node from the cluster by removing its site label, which causes redirecting traffic to the remaining nodes during the upgrade. Once upgraded, the label is reapplied, and this process is repeated for each node in turn. This is a model that customers have known for 20+ years for upgrade procedures, with a little wrinkle with the label. 3- Customers who prefer to distribute the load across remote sites Nodes are deployed independently and do not require inter-node heartbeat communication, unlike the default clustering method. This independence allows for their deployment across various data centers and availability zones while being managed as a single entity. They are compatible with both Layer 2 (L2) spanned and Layer 3 (L3) spanned data centers, where nodes in different L3 networks utilize distinct gateways. As long as the nodes can access the origin pools, they can be integrated into the same "Virtual Site". This flexibility caters to customers' traditional preferences, such as deploying two CE nodes per location, which is fully supported by this clustering model. Challenges: Lack of VRRP Support: The primary limitation of this clustering method is the absence of VRRP support for internal VIPs. However, there are some alternative methods to distribute internal traffic across CE nodes. These include DNS based routing, BGP with Equal-Cost Multi-Path (ECMP) routing, or the implementation of CEs behind another Layer 4 (L4) load balancer capable of traffic distribution without source address alteration, such as F5 BIG-IPs or the standard load balancers provided by Azure or AWS. Limitation on Custom VIP IP Support: Currently, the F5 Distributed Cloud Console has a restriction preventing the configuration of custom virtual IPs for load balancer advertisements on Virtual Sites. We anticipate this limitation will be addressed in future updates to the F5 Distributed Cloud platform. As a temporary solution, you can advertise the LB across multiple individual sites within the Virtual Site. This approach enables the configuration of custom VIPs on those sites. Requires extra steps for upgrading nodes Unlike the Default clustering model where upgrades can be performed collectively on a group of nodes, this clustering model requires upgrading nodes on an individual basis. This may introduce more steps, especially in larger clusters, but it remains significantly simpler than traditional network device upgrades. Large-Scale Mesh Group: In F5 Distributed Cloud, the "Mesh Group" feature allows for direct connections between sites (whether individual CE sites or clusters of CEs) and other selected sites through IPSec tunnels. For CE clusters, tunnels are established on a per-cluster basis. However, for single-node sites, each node creates its own tunnels to connect with remote CEs. This setup can lead to an increased number of tunnels needed to establish the mesh. For example, in a network of 10 sites configured with dual-CE Virtual Sites, each CE is required to establish 18 IPSec tunnels to connect with other sites, or 19 for a full mesh configuration. Comparatively, a 10-site network using the default clustering method—with a minimum of 3 CEs per site—would only need up to 9 tunnels from each CE for full connectivity. Opting for Virtual Sites with dual CEs, a common choice, effectively doubles the number of required tunnels from each CE when compared to the default clustering setup. However, despite this increase in tunnels, opting for a Mesh configuration with single-node clusters can offer advantages in terms of performance and load distribution. Note: Use DC Groups as an alternative solution to Secure Mesh Group for CE connectivity: For customers with existing private connectivity between their CE nodes, running Site Mesh Group (SMG) with numerous IPsec tunnels can be less optimal. As a more scalable alternative for these customers, we recommend using DC Cluster Group (DCG). This method utilizes IP-in-IP tunnels over the existing private network, eliminating the need for individual encrypted IPsec tunnels between each node and streamlining communication between CE nodes via IP-n-IP encapsulations. Configuration Steps The configuration for creating single node clusters involves the following steps: Creating a Label Creating a Virtual Site Applying the label to the CE nodes (sites) Review and validate the configuration The detailed configuration guide for the above steps can be found here: https://docs.cloud.f5.com/docs/how-to/fleets-vsites/create-virtual-site Example Configuration: In this example, you can create a label called "my-vsite" to group CE nodes that belong to the same Virtual Site. Within this label, you can then define different values to represent different environments or clusters, such as specific Azure region or an on-premise data center. Then a Virtual Site of “CE” type can be created to represent the CE cluster in “Azure-AustraliaEast-vSite" and tied to any CE that is tagged with the label “my-vsite=Azure-AustraliaEast-vSite”: Now, any CE node that should join the cluster (Virtual Site), should get this label: Verification: To confirm the Virtual Site configuration is functioning as intended, we joined two CEs (k1-azure-ce2 and k1-azure-ce03) into the Virtual Site and evaluated the routing and load balancing behavior. Test 1: Public Load Balancer (Virtual Site referenced in the pool) The diagram shows a public "Load Balancer" advertised on the RE referencing a pool that uses the newly created Virtual Site to access the private application: As shown below, the pool member was configured to be accessed through the Virtual Site: Analysis of the request logs in the Performance dashboard confirmed that all requests to the public website were evenly distributed across both CEs. Test 2: Internal Load Balancer (LB advertised on the Virtual Site) We deployed an internal Load Balancer and advertised it on the newly created Virtual Site, utilizing the pool that also references the same Virtual Site (k1-azure-ce2 and k1-azure-ce03). As shown below, the Load Balancer was configured to be advertised on the Virtual Site. Note: Here we couldn't use a "shared" custom VIP across the Virtual Site due to a current platform constraint. If a custom VIP is required, we should use "site" as opposed to "Virtual Site" and advertise the Load Balancer on all sites, like below picture: Request logs revealed that when traffic reached either CE node within the Virtual Site, the request was processed and forwarded locally to the pool member. In the example below: src_site: Indicates the CE (k1-azure-ce2) that processed the request. src_ip: Represents the client's source IP address (192.168.1.68). dst_site: Indicates the CE (k1-azure-ce2) from which the pool member is accessed. dst_ip: Represents the IP address of the pool member (192.168.1.6). Resilience Testing: To assess the Virtual Site's resilience, we intentionally blocked network access from k1-azure-ce2 CE to the pool member (192.168.1.6). The CE automatically rerouted traffic to the pool member via the other CE (k1-azure-ce03) in the Virtual Site. Note: By default, CEs can communicate with each other via the F5 Global Network. This can be customized to use direct connectivity through tunnels if the CEs are members of the same DC Cluster Group (IP-n-IP tunneling) or Secure Mesh Groups (IPSec tunneling). The following picture shows the traffic flow via F5 Global Network. The following picture shows the traffic flow via the IP-n-IP tunnel when a DC Clustering Group (DCG) is configured across the CE nodes. Failover Behaviour When a CE node is tied to a Virtual Site, all internal Load Balancers (VIPs) advertised on that Virtual Site will be deployed in the CE. Additionally, the Regional Edge (RE) begins to use this node as one of the potential next hops for connections to the origin pool. Should the CE become unavailable, or if it lacks the necessary network access to the origin server, the RE will almost seamlessly reroute connections through the other operational CEs in the Virtual Site. Uncontrolled Failover: During instances of uncontrolled failover, such as when a node is unexpectedly shut down from the hypervisor, we have observed a handful of new connections experiencing timeouts. However, these issues were resolved by implementing health checks within the origin pool, which prevented any subsequent connection drops. Note: Irrespective of the clustering model in use, it's always recommended to configure health checks for the origin pool. This practice enhances failover responsiveness and mitigates any additional latency incurred during traffic rerouting. Controlled Failover: The moment a CE node is disassociated from the Virtual Site — by the removal of its label— the CE node will not be used by RE to connect to origin pools anymore. At the same time, all Load Balancer listeners associated with that Virtual Site are withdrawn from the node. This effectively halts traffic processing for those applications, preventing the node from receiving related traffic. During controlled failover scenarios, we have observed seamless service continuity on externally advertised services (to REs). On-Demand Scaling: F5 Distributed Cloud provides a flexible solution that enables customers to scale the number of active CE nodes according to demand. This allows you to easily add more powerful CE nodes during peak periods (such as promotional events) and then remove them when demand subsides. With the Virtual Sites method, you can even mix and match node sizes within your cluster (Virtual Site), providing granular control over resources. It's advisable to monitor CE node performance and implement node related alerts. These alerts notify you when nodes are operating at high capacity, allowing for timely addition of extra nodes as needed. Moreover, you can monitor node’s health in the dashboard. CPU, Memory and Disk utilizations of nodes can be a good factor in determining if more nodes are needed or not. Furthermore, the use of Virtual Sites makes managing this process even easier, thanks to labels. Node Based Alerts: Node-based alerts are essential for maintaining efficient CE operations. Accessing the alerts in the Console: To view alerts, go to Multi-Cloud Network Connect > Notifications > Alerts. Here, you can see both "Active Alerts" and "All Alerts." Alerts related to node health fall under the "infrastructure" alert group. The following screenshot shows alerts indicating high loads on the nodes. Configuring Alert Policies: Alert policies determine the notification process for raised alerts. To set up an alert policy, navigate to Multi-Cloud Network Connect > Alerts Management > Alert Policies. An alert policy consists of two main elements: the alert receiver configuration and the policy rules. Configuring Alert Receiver: The configuration allows for integration with platforms like Slack and PagerDuty, among others, facilitating notifications through commonly used channels. Configuring Alert Rules: For alert selection, we recommend configuring notifications for alerts with severity of “Major” or “Critical” at a minimum. Alternatively, the “infrastructure” group which includes node-based alerts can be selected. Comparison Table Criteria Default Cluster Single Node HA Minimum number of nodes in HA 3 2 Upgrade operations Per cluster Per Node Network redundancy and client side routing for east-west traffic VRRP, BGP, DNS, L4/7 LB DNS, L4/7 LB, BGP* Tunnels to RE 2 tunnels per cluster 2 tunnels per node Tunnels to other CEs (SMG or DCG) 1 tunnel from each cluster 1 tunnel from each node External traffic processing Limited to 2 nodes All nodes will be active Internal traffic processing All nodes can be active All nodes can be active Scale management in Public Cloud Sites Straightforward, by configuring ingress interfaces in Azure/AWS/GCP sites Straightforward, by adding or removing the labels Scale management in Secure Mesh Sites Requires reconfiguring the cluster (secure mesh site) - may cause interruption Straightforward, by adding or removing the labels Custom VIP IP Available Not Available (Planned to be available in future releases), workaround available. Node sizes All nodes should be same size. Upgrading node size in a cluster is a disruptive operation. Any node sizes or clusters can join the Virtual Site * When using BGP, please note a current limitation that prevents configuring custom VIP address on the Virtual Site. Conclusion: F5 Distributed Cloud offers a flexible approach to High Availability (HA) across CE nodes, allowing customers to select the redundancy model that best fits their specific use cases and requirements. While we continue to advocate for default clustering approach due to their operational simplicity and shared VRRP VIP or, unified network configuration benefits, especially for routine tasks like upgrades, the Virtual Site and single node HA model presents some great use cases. It not only addresses the limitations and challenges of the default clustering model, but also introduces a solution that is both scalable and adaptable. While Virtual Sites offer their own benefits, we recognize they also present trade-offs. The overall benefits, particularly for scenarios demanding high ingress (RE to CE) throughput and controlled failover capabilities cater to specific customer demands. The F5 product and development team remains committed to addressing the limitations of both default clustering and Virtual Sites discussed throughout this article. Their focus is on continuous improvement and finding the solutions that best serve our customers' needs. References and Additional Links: Default Clustering model: https://docs.cloud.f5.com/docs/ves-concepts/site#cluster-of-nodes Configuration guide for Virtual Sites: https://docs.cloud.f5.com/docs/how-to/fleets-vsites/create-virtual-site Routing Options for CEs: https://community.f5.com/kb/technicalarticles/f5-distributed-cloud---customer-edge-site---deployment--routing-options/319435 Configuration guide for DC Clustering Group: https://docs.cloud.f5.com/docs/how-to/advanced-networking/configure-dc-cluster-group4KViews8likes2Comments