chaos engineering
1 TopicTaming your “Chaos Monkey” with F5 Distributed Cloud Platform
Overview Recently, my family returned from a holiday trip to Japan. While the holiday itself was amazing, this article isn't about the experiences or the chaos my children caused; rather, it's about the significant role technology and applications played in enhancing our vacation and our lives in the digital world. Please also notes that “Application” in this context loosely use to refer to software/applications/AI apps/API or systems that power the digital world. Throughout our journey, we found ourselves heavily reliant on various applications, ranging from weather forecasts to navigation aids. We utilized weather apps to stay informed and dressed appropriately, GPS apps to navigate bustling cities and public transportation, and mobile payment apps for seamless transactions. Social media platforms allowed us to update family and friends on our whereabouts, while continuous access to mobile internet (via 4/5G connectivity) kept us tethered to the digital world. Additionally, we interacted with numerous indirect applications and systems, such as ordering food in cafes, different ticketing systems, or using Automated Teller Machines (ATM) for cash withdrawals. Reflecting on these travel experiences prompts consideration of the potential implications had these apps not existed or malfunctioned during our visit. While it might not have been catastrophic, it would have certainly detracted from the smoothness and enjoyment of our holiday. For instance, the failure of my mobile payment app could have hindered transactions, or what if a life-threatening event occurred and the network went down, accessing emergency services would have been impossible—a potentially catastrophic situation that I couldn’t had imagine. The crux of the matter is the paramount importance of ensuring that these applications remain always available, secure, and resilient. They have become integral to modern life, not just enhancing convenience but also playing a crucial role in safety and well-being. Therefore, efforts to maintain their reliability and functionality are imperative in navigating our increasingly digital world. In our increasingly interconnected world, reliance on technology already ubiquitous. The resilience of apps and systems is now a paramount concern for any organization, occupying the top priority in the minds of many executives (CxOs). When these systems fail, causing disruptions for customers or citizens, CxOs may find themselves compelled to respond publicly or even testify before various authorities, demonstrating their due diligence in managing and maintaining these critical assets. Hence, organization need to have strategies to assess and analyse failure mode and impact analysis of those critical application failure. Numerous methodical strategies exist to study and ensure the resilience of apps and systems, such as Failure Modes, Effects, and Criticality Analysis (FMECA), Failure Mode and Effects Analysis (FMEA), and Chaos Engineering. While the intricacies of these methodologies won't be covered in depth here, it's important to introduce them and highlight their shared objective: mitigating business/availability risks to prevent harm to business when apps or systems encounter failures. The focus of this article is to demonstrate how F5's Distributed Cloud (F5XC) Secure Multi-Cloud Networking (MCN) for Kubernetes can address some of these failure scenarios, particularly through the lens of Chaos Engineering. Chaos Engineering involves deliberately inducing failures in a controlled environment to test system resilience. In this demonstration, I’ll leverage the Open Source Chaos Engineering platform to simulate failure scenarios within a running production system. I will use a sample financial application, Arcadia Finance, as our subject for chaos testing. This application consists of microservices distributed across heterogeneous Kubernetes environments, including Amazon EKS, Azure AKS, Google GKE, and Red Hat OpenShift Container Platform (OCP). F5's XC Mesh for Kubernetes can run on any of these Kubernetes platforms and itself formed a secure mesh fabric to orchestrate apps connectivity, delivery, security and observability between those heterogenous container platform. Regardless of the specific strategy employed, the goal remains consistent: implementing risk prevention strategies to safeguard against the potential harm to business caused by app or system failures. Please do note that full end-to-end demo video at the end of this article. Below are some of the mentioned methodologies. Please refer to respective literature for details. FMECA / FMEA From “Find failure and fix it” to “anticipate failure and prevent it” Extracted from (https://www.getmaintainx.com/learning-center/what-is-fmeca-failure-mode-effects-and-critical-analysis/) FMECA is a risk assessment methodology in which you determine failure modes, assess their level of risk to your equipment or system, and rate the failure based on that level of risk. The U.S. military invented this FMECA analysis technique in the ‘40s. The military continues to use the FMECA even today under the MIL STD-1629A. FMECA is a commonly used technique for performing failure detection and criticality analysis on systems to improve their performance. In addition, it typically provides input for Maintainability Analysis and Logistics Support Analysis, both of which rely on FMECA data. With Industry 4.0, many industries are adopting a predictive maintenance strategy for their equipment. To prioritize failure modes and identify mechanical system and subsystem issues for predictive maintenance, FMECA is a widely used tool. Chaos Engineering Excerpt from https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems Note: Chaos Monkey serves as a critical tool in enhancing chaos engineering; it enables engineering teams to simulate failures across multiple configurations and monitor the system's behaviour in real time. It was a set of tools that originally open source by Netflix. In this demo, Open Source Litmus will be use instead of Chaos Monkey. Litmus Chaos Platform Litmus is an open source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way. It is a Cloud-Native Chaos Engineering Framework with cross-cloud support. It is a CNCF Incubating project with adoption across several organizations. Its mission is to help Kubernetes SREs and Developers to find weaknesses in both Non-Kubernetes as well as platforms and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated Chaos Experiments. Litmus adopts a "Kubernetes-native" approach to define chaos intent in a declarative manner via Kubernetes custom resources (CRs). Litmus platform consist of Control Plane, Execution Plane and Chaos Fault flow. Please refer to official documentation for details - https://docs.litmuschaos.io/docs/introduction/what-is-litmus Chaos Center (Chaos Control Plane) is deployed on F5XC AppStack and Chaos Execution Plane (Litmus Agents/Infrastructure) installed on respective Kubernetes Platform. Litmus agents communicate with Chaos Center via F5XC Secure Mesh Fabric. This is a Traffic Graph where Litmus agent on respective K8S communicating to Chaos Centre over websocket connection. These private connections are secured and protected by F5XC. Chaos Engineering Demo - High Level Demo Architecture In this demo environment, Litmus agents are deployed on both Amazon EKS and Red Hat OCP. Arcadia Finance, comprising multiple microservices (applications and APIs), are distributed across heterogeneous container platforms. The demo will focus on two specific use cases: Use Case #1: Frontend Application Latency Demonstrating network latency impacting frontend applications (EKS), resulting in unresponsive app behavior within critical timeframes. Use Case #2: Production Deployment Issues Showcasing the deployment of an updated version of the money-transfer API container (OCP) leading to the money-transfer API pods entering a CrashLoopBack state, hindering production functionality. Litmus (open source) is capable to inject more than 50 chaos experiment – for example, on Kubernetes; pods kill, pod delete, network latency, pod network disruption, node failure and many more. Please refer to litmus documentation for the complete list of chaos. F5 Distributed Cloud Platform Customer Edge Sites Arcadia Finance Sample Application Construct Litmus Chaos Environment for this Demo Litmus agents installed, registered and connected onto Litmus Chaos Center via F5XC Mesh Fabric. Litmus Chaos Center installed on F5's AppStack Kubernetes. Chaos Experiment created for arcadia frontend 4s network latency injected into arcadia frontend Continuous probe (health check) of the frontend to ensure application still functioning and accessible Without multi-cluster resiliency Injected chaos network latency Running Chaos Experiment workflow Logs shown on F5XC before adding multi cluster resiliency. As shown after 4s latency injected to frontend (served from foobz-mesh-eks1), user/probe unable to get to frontend and subsequent request return 503 error as no available site to handle the introduce network latency. End user will received 503 error – “application down” Chaos Experiments test completed with Failure and Resilience Score to 0%. End Result - Application unavailable. Resilient Score - 0% With multi-cluster resiliency Introduce Google Cloud GKE as part of the backup origin pool for arcadia frontend via CI/CD Pipeline. In the even if frontend on EKS unable to handle or failed, traffic will be steered/redirected to GKE. Similar Chaos experience will be run and completed successful with Resilience Score of 100%. From XC request logs shown traffic seamlessly transition from "foobz-mesh-eks1" to "foobz-mesh-gke1" End Result - Application Always Available. Resilient Score - 100% Similar backup site ("foobz-mesh-aks1") will be added via CI/CD Pipeline into the money-transfer apps/api to provide high redundancy. Deployment of rogue software onto money-transfer api pods that causes money-transfer pod into a CrashLoopBack. From XC logs, you can see that money-transfer served from foobz-ves-ocp-sg transition to foobz-mesh-aks1 seamlessly While refer-friend module still remain in foobz-ves-ocp-sg as refer-friend apps/api are healthy in foobz-ves-ocp-sg End-Result with F5 Distributed Cloud Mesh Demo Video Summary F5 is delivering on its mission to make it significantly easier to secure, deliver, and optimize any app, any API, anywhere. We strive to bring a better digital world to life. Our teams empower organization across the globe to create, secure, and run applications that enhance how we experience our evolving digital world.271Views2likes2Comments