Adopting SRE practices with F5: Multi-cluster Blue-green deployment
In lastarticle, wecoveredblue-green deploymentas the most straightforward SRE deployment modelatahigh level,herewe are divingdeeperinto the detailsto see how F5 technologies enable this use case. Let’sstart offbylooking atsome of the key components. F5 DNS Load Balancer Cloud Service(GSLB) The first component of the solution is F5CloudService. TheDNS Load Balancer provides GSLB as a cloud-hosted SaaS service with built-in DDoS protection and an API-firstapproach.A blue-green deployment aims to minimize downtimedue to app deployment, and there are some basicroutingmechanismsout oftheboxwithOpenShiftthat assist in this area. However, ifwe are looking forswift routing switchwithmore flexibilityandreliabilityacross differentOpenShiftclusters, different clouds, or geo locations, this is when F5 DNS Load Balancer Cloud Service comes into the picture. Setting up DNS for F5 Cloud Services This solution requires thatyour corporateDNS server delegatesa DNSzone (akasubdomain)to theF5DNS Load BalancerCloud Service. An OpenShift cluster typically has its own domaincreatedfor the applications, for example: *.apps.<cluster name>.example.com.Theend user,however,doesn't really use such a long name and instead queries for www.example.com.A CNAME record is often used tomapone domainname(analias)to another (thetrue domainname). All set up, this is theDNSscenario: In case the customer hasmore than onecluster,itrequiresoneCNAME recordpercluster,with requestsload balancedamong clusters.The drawbacks ofthis type of solutionsinclude: No comprehensive health checkingandmonitoring Unabletoswitchworkloads across clustersat speed Lack of automation and integration with the OpenShiftcluster F5 Cloud Services provides these features in amulti-cluster and multi-cloudinfrastructure around the globe with the ease of aSaaSsolution,without the need ofinfrastructure modifications.You will set up your corporateDNStouseF5DNS Load Balancer CloudServiceasfollows: Here is a sample configuration foraCloud/CorporateDNS: You can register an F5 Cloud Service account, and then subscribe to DNS Load BalancerSservicehere: F5CloudServices F5 GSLB toolfor Ansible Automation The blue-green deployment represents a sequence of steps to rollout your new application.GSLB toolis developed toprovide a common automation plane for both OpenShift and F5 Cloud Service. LeveragingthedeclarativeAPI fromF5DNSLoadBalancerCloud Serviceand OpenShift, we used Ansible to automate the process. Itenables you to standardize and automate release and deployment by orchestrating the build, test, provisioning, configuration management, and deployment tools in yourContinuousDelivery pipeline. More specifically,GSLB toolautomatesyourinteraction with: theOpenShift/K8s deployments Retrieve Layer 7 routesfrom given project/namespace and OpenShift Cluster CopyLayer 7routes of a given project/namespace from one OpenShift Cluster to another F5 DNS Load BalancerService Createof GSLBload Balanced Records (LBRs)along with needed pieces (Monitors, IP endpoints, Pools etc.) Set the GSLB ratio for each deployment for a given project/namespace The benefits of using GSLB tool to automate the entire process: Improve speed and scale especially with100’s ofOpenShiftroutes Eliminate room for human error Achievedeterministic and repeatable outcomes I want to give credit tomy colleague,Ulises Alonso Camaro,who developedthe GSLB tool.Please refer to theGitHubfor details of the GSLB tool, andwikion how to set up the tool and operation. Buildand Run the Blue-green Deployment Now we canlookathow we can use F5 DNS Load BalancerServiceand GSLB tool to canary test the new version and manipulate the traffic routingfor Blue-green deployment.In Blue-green deployment, we deploy two versions of the application running simultaneously in two identical production environments called Blue (OpenShift Cluster 1) and Green (new OpenShift Cluster 2). Step 1. Retrieve routes from Blue cluster and push to F5 DNSLoad BalancerCloud Service Once you haveinstalledthe GSLB tool and configured thedeployment settingsfor your infrastructure,the firstset ofcommandsto runare ./project-retrieve defaultaws1&&./gslb-commit "publish routes from Blue cluster to F5 DNS load balancer" These commandsretrieve the OpenShift route(s) from your Blue clusteraws1, andthenpublishtheretrieved routes intoF5 DNS Load BalancerCloud Service. Step 2. Retrieve routes from Green cluster and push toF5 DNSLoad Balancer Cloud Service User to input the following commands: ./project-retrieve defaultaws2&&./gslb-commit "publish routes from Blue cluster to F5 DNS load balancer" Thesecommands will retrieve the OpenShift route(s) from your Green clusteraws2, andpush toF5 DNS Load Balancer Cloud Servicesconfiguration Step 3. Canary test green deployment User to input the following command: ./project-ratios default '{"aws1": "90", "aws2": "10" }&&./gslb-commit "canary testing blue and green clusters" The commands will set the traffic ratio fortheBlue (90%) andtheGreen deployment (10%) andpublish the configuration. As you can see,F5 DNS Load Balancer Cloud Servicesets the traffic ratio for each endpoint accordingly. Step 4. Switch traffic to Green After the testing succeeds, itis time to switch production traffic totheGreen cluster. User to input the following commands: ./project-evacuate defaultaws1&&./gslb-commit "switch all traffic to green cluster" The commands will switch the traffic completely fromtheBlue to theGreen deployment. More ArchitecturalPatterns There are many related patternsforBlue-greendeployment, each ofwhichoffers a different focus for an automated production deployment. Some examplevariantsinclude: Infrastructure as Code (IaC)In this variant of the pattern the release deployment target environment does not exist until it is created by the DevOps pipeline.Post deployment the original ‘blue’ environment is scheduled for destruction once the ‘green’ environment is considered stable in production. Container-based DeploymentIn this variant of the pattern the release deployment target is represented as a collection of one or more containers.Post release, once the ‘green’ environment is considered stable in production, the containers represented by the ‘blue’ container group are scheduled for destruction. Our solution can address allBlue-green deploymentvariants, withresources used in theblueandgreenenvironments can becreated or destroyed as needed, orthey can begeographically distributed. WhileContinuous Deployment (CD) is a natural fit for the Blue-green deployment,F5 DNSLoad BalancerCloud Servicecombined with GSLB tool can enable manypossibilitiesand support a collection of architecture patterns including: Migrateapplication froma source cluster(OCP 3.x)to a destination cluster(OCP 4.x),referherefor details MigrateworkloadfromKubernetescluster to OpenShift Cluster Modernize your application deployment with Lift and Shift. Repackage your application running as a set of VM’s into containers, and deploy then into OpenShift or Kubernetes cluster Built intoCI/CD pipeline so that any future changes to the application are built and deployed automatically. We arecontinuouslyworking onmore usagepatternsandwillexplore in more details in future blog posts. What’s next? So,go aheadtoDevCentralGitHub repo,download source code behind our technologies,follow the guide totestit out in yourownenvironment.1.4KViews1like1CommentAdopting Site Reliability Engineering with F5
Foreword The role of the Site Reliability Engineering (SRE) is common in cloud first enterprises and becoming more widespreadin traditional IT teams.Here, we would like to kick off this article series to look at the concepts that give SRE shape, outline the primary tools and best practices that make it possible, and explore some common use cases around Continuous Deployment (CD) strategy, visibility and security. While SRE and DevOps share many areas of commonality, there are significant differences between them. DevOps is a loose set of practices, guidelines, and culture designed to break down silos in Development, IT operations, Network, and Security team. DevOps does not tell you how to run operations at a detailed level. On the other hand, SRE, a term pioneered by Google, brings an opinionated framework to the problem of how to run operations effectively. If you think of DevOps as a philosophy, you can argue that SRE implements some of the philosophy that DevOps describes. In a way, SRE implements DevOps practices. After all, SRE only works at all if we have tools and technologies to enable it. Balancing Release Velocity and Reliability SRE aims to find the balance between feature velocity and reliability, which are often treated as opposing goals. Despite the risk of making changes to software, these changes are necessary for the business to succeed. Instead of advocating against change, SRE uses the concept of Service Level Objectives (SLOs) and error budgets to measure the impact of releases on reliability. The goal is to ship software as quickly as possible while meeting the reliability targets the users expect. While there are a wide range of ways an SRE-focused IT team might optimize the balance between agility and stability, two deployment models stand out for their widespread applicability and general ease of execution: Blue-green deployment For SRE, availability is currently the most common SLO. If getting new software to your users and uninterrupted access is truly required, there needs to be engineering work to implement load balancing or fractional release measures like blue-green or canary deployments to minimize any downtime. Recovery is a factor too. The idea behind blue-green deployment is that your blue environment is your existing production environment carrying live traffic. In parallel, you provision a green environment, which is identical to the blue environment other than the new version of your code. As you prepare a new version of your software, deployment and the final stage of testing takes place inthe environment that is not live: in this example, Green (or new OpenShift Cluster). When it's time to deploy, you route production traffic from the blue environment to the green environment. This technique can eliminate downtime due to app deployment. In addition, blue-green deployment reduces risk: if something unexpected happens with your new version on Green, you can immediately roll back by reverting traffic to the original blue environment. When you are looking for manipulating the traffic with more flexibility, reliability, across different clusters, different clouds, or geo locations, this is when F5 DNS Load Balancer Cloud Service comes into the picture. F5 Cloud service GSLB is a SaaS offering. It can provide automatic failover, load balancing across multiple locations, increased reliability by avoiding a single point of failure, and increased performance by directing traffic to the optimal site. This allows SRE to move fast while still maintaining enterprise grade. Targeted Canary deployment Another approach to promote availability for SRE SLO is canary deployment. In some cases, swapping out the entire deployment via a blue-green environment may not be desired. In a canary deployment, you upgrade an application on a subset of the infrastructure and allow a limited set of users to access the new version. This approach allows you to test the new software under a Production-like load, evaluate how well it meets users’ needs, and assess whether new features are profitable. One approachoftenused by Azure DevOps is ring deployment model. Users fall into three general buckets based on their respective different risk profiles: Ring 1 - Canaries who voluntarily test bleeding edge features as soon as they are available. Ring 2 - Early adopters who voluntarily preview releases, considered more refined than the canary bits. Ring 3 - Users who consume the products, after passing through canaries and early adopters. Developer can promote and target new versions of the same application (version 1.2, 1.1, 1.0) to targeted users (ring 1, 2 and 3) respectively, without involving and waiting the infrastructure operations team (NoOps). To identify theuserfor the right version, you maychoose tosimplyuse IP address, authenticate directlyby backend, oradd an authenticationlayerin front of the backend.F5 technologies can helpenable this targeted canary use case: BIG-IP APM in N-S will authenticate and identify users as ring 1, 2 or 3, and inject user identification into HTTP header This identification is passed on to NGINX plus micro-gateway to direct users to the correct microservice versions. Combining BIG-IP and NGINX, this architecture uniquely gives SRE the flexibility to adapt with the ability to define the baseline service control and security (for NetOps or SecOps), while extending controls for more granular and enhanced security to the developer team (for DevOps). The need for observability For SRE, at the heart of implementing SLOs practically is monitoring. You can't understand what you can't see. A classic and common approach to monitoring is to watch for a specific value or condition, and then to trigger an alert when that value is exceeded or that condition occurs. One of the valid monitoringoutputsis logging, which is recorded for diagnosis or forensic purposes. The ELK stack, a collection of three open source projects, namely Elasticsearch, Logstash and Kibana, provides IT project stakeholders the capabilities of multi-system and multi-application log aggregation and analysis. ELKcan beutilized for the analysis and visualization of applicationmetricsthrough a centralized dashboard. With general visibility in place, tracking can be enabled in order to add a level of specificity to what is being observed.Taking advantageofiRuleon BIG-IP,NetOps can generateUUID and insertitinto the HTTP header of every HTTP request packet arriving at BIG-IP. All traffic access logs containing UUIDs,fromBIG-IP and NGINX,are sent to the ELK server, for validation of informationsuch asuser location, response time by user location, response timeetc. Through the dashboard, end-users can easily correlate North-South traffic (processed by BIG-IP) with East-West traffic (processed by NIGNX+ inside cluster), for an end-to-end performance visibility. In turn, tracking performance metrics opens up the possibility of defining service level objectives (SLO). With observability, security is possible Security incident will always occur, and hence it's essential to integrate security into observability. What’s most important is giving reliability engineers the tools so that they can identify the security problem, work around it, and fix it as quickly as possible. Using the right set of tools, you can build custom autogenerated dashboards and tooling to expose the generated information to engineers in a way that makes it much easier to sort through everything and determine the root cause of a security problem. These include things like Kibana dashboard, which allows engineers to investigate incident, apply filters, quickly pinpoint suspicious data traffic and source. In concert with F5 Advanced WAF and NGINX App Protect, SRE can protect applications against software vulnerabilities and common attacks from both inside and outsidemicroservice clusters.UponBIG-IP Advance WAF orNGINX App Protect detect suspicious traffic, it sends alert with details toELK stack, whichwillindex, and processthe data, and thenexecute the pre-defined ‘Ansible Playbook’,to enforce security policy into Kubernetes or NGINX App Protect for immediate remediation. SRE does not only identify but rectify the anomalies by enacting security policy enforcement along the data path.Detect once and protecteverywhere. What’s next? This serves as an introductiontoor the first article of this SRE article series.In the coming articles, we will deep dive into each of the use cases, to showcase the technical details about how we are leveraging F5technologies and capabilitiestohelpSRE bring together DevOps, NetOps, and SecOps to develop the safeguards and implement the best practices. To learn more about developing a business case for SRE in your organization, please reach out toanF5 Business Development. For technical details and additional information, see thisDevCentralGitHub repo.1.1KViews0likes0Comments