How to Prepare Your Network Infrastructure to Add HPC Clusters for AI to Your Data Center
HPC AI clusters are getting deployed as highly-engineered 'lego blocks' which are opaque to established data center operations and standards. By taking advantage of established Kubernetes based networking solutions that provide high-speed intelligent networking, you can save yourself from expensive cost overruns, data center re-auditing, and delays. By using Kubernetes based solutions which take advantage of the high-speed networking solutions already required by HP AI deployments, you further optimize your investment in AI.140Views4likes0CommentsAnnouncing F5 NGINX Gateway Fabric 1.4.0 with IPv6 and TLS Passthrough
We announced the next release of F5 NGINX Gateway Fabric version 1.4.0 which includes a lot of smaller but very necessary features. This allows us to dedicate more time to advancing our non-functional testing framework and ensuring we maintain top performance across releases. Nevertheless, we have some great highlights of this release: IPv6 support TLS passthrough (via TLSRoute) Server zone metrics Ability to add custom pod annotations Plenty of bug fixes! During this release cycle, we discovered a bug around our custom policies that occurred when you had the same path for more than one Route: The policy would not be applied to either Route. For this release, we’ve decided to enforce a restriction so that policies cannot be applied when two or more routes share the same path. However, we are pursuing a long-term solution to lift this restriction on this edge case, as we understand that use cases that route based on header, query parameter, or other request attributes on the same path do exist. IPv6 Support While most Kubernetes clusters are still utilizing IPv4, we recognized that anyone employing a IPv6 cluster would have no ability to deploy NGINX Gateway Fabric. Thus, we implemented a simple feature to dual IPv4/IPv6 networking for NGINX Gateway Fabric. This option is enabled by default, so you can simply install as normal on an IPv6 cluster. TLS Passthrough New with 1.4 is TLSRoute support. This Route type enables the TLS Passthrough use case and is similar to setting up an HTTPRoute. This allows you to pass encrypted traffic through NGINX Gateway Fabric where it is terminated by your backend application, ensuring end-to-end encryption. As most information passes through NGINX Gateway Fabric with this route, setup is easy. You can enable TLS passthrough for any application using our guide available here. Non-Functional Testing This release marks the completion of automating our non-functional testing that we execute before each release. If you are unfamiliar with these tests, our team runs NGINX Gateway Fabric through a series of scenarios, non-functional tests, to test if our performance is regressing or improving from previous releases. As an infrastructure product that you rely on, it is our top priority to ensure that stability and performance are not compromised as new features are released. The results of all non-functional testing are available in the GitHub repository for anyone to see and should give you an idea of how well NGINX Gateway Fabric performs in general and across releases. What’s Next NGINX Gateway Fabric 1.5.0 will bring NGINX code snippets to the Gateway API with a first-class Upstream Settings policy to configure keepalive connections and NGINX zone size. If you are familiar with NGINX or find that you need to use a feature that NGINX provides that is not yet available via a Gateway API extension, you can put a NGINX code snippet within a SnippetFilter to apply NGINX configuration to a Route rule. You will even be able to use the feature to load other modules NGINX provides and leverage the vast wealth of NGINX functionality. We will still be providing many NGINX features via first-class policies and filters, such as the Upstream Settings policy, as they allow us to handle much of the complexity of translating to Gateway API for you. These custom policies and filters allow us to handle a lot of the complexity of applying NGINX config across the Gateway API framework for you. The Upstream Settings policy can set upstream management directives that are unable to be applied via snippets effectively. We will continue to deliver these custom policies and filters across all of our releases, in addition to new Gateway API resources and NGINX Gateway Fabric specific features. You can see a preview of the full snippet design here, though not all features may be implemented in one release cycle. For more information on our strategy towards first-class NGINX customization via Gateway API extensions, see our full enhancement proposalhere. Resources For the complete changelog for NGINX Gateway Fabric 1.4.0, see the Release Notes. To try NGINX Gateway Fabric for Kubernetes with NGINX Plus, start your free 30-day trial today or contact us to discuss your use cases. If you would like to get involved, see what is coming next, or see the source code for NGINX Gateway Fabric, check out our repository on GitHub! We have weekly community meetings on Tuesdays at 9:30AM Pacific/12:30PM Eastern/5:30PM GMT. The meeting link, updates, agenda, and notes are on the NGINX Gateway Fabric Meeting Calendar. Links are also always available from our GitHub readme.86Views1like0CommentsAnnouncing F5 NGINX Gateway Fabric 1.3.0 with Tracing, GRPCRoute, and Client Settings
The release of NGINX Gateway Fabric version 1.3.0, introduces plenty of highly requested features and improvements. GRPCRoutes are now supported to manage gRPC traffic, similar to the handling of HTTPRoute. The update includes new custom policies like ClientSettingsPolicy for client request configurations and ObservabilityPolicy for enabling application tracing with OpenTelemetry support. The GRPCRoute allows for efficient routing, header modifications, traffic weighting, and error conversion from HTTP to gRPC. We will explain how to set up NGINX Gateway Fabric to manage gRPC traffic using a Gateway and a GRPCRoute, providing a detailed example of the setup. It also outlines how to enable tracing through the NginxProxy resource and ObservabilityPolicy, emphasizing a selective approach to tracing to avoid data overload. Additionally, the ClientSettingsPolicy allows for the customization of NGINX directives at the Gateway or Route level, giving users control over certain NGINX behaviors with the possibility of overriding Gateway defaults at the Route level. Looking ahead, the NGINX Gateway Fabric team plans to work on TLS Passthrough, IPv6, and improvements to the testing suite, while preparing for larger updates like NGINX directive customization and separation of data and control planes. Check the end of the article to see how to get involved in the development process through GitHub and participate in bi-weekly community meetings. Further resources and links are also provided within.182Views0likes0CommentsSecuring and Scaling Hybrid Apps with F5/NGINX (Part 3)
In part 2 of our series, I demonstrated how to configure ZT (Zero Trust) use cases centering around authentication with NGINX Plus in hybrid environments. We deployed NGINX Plus as the external LB to route and authenticate users connecting to my Kubernetes applications. In this article, we explore other areas of the ZT spectrum configurable on the External LB Service, including: Authorization and Access Encryption mTLS Monitoring/Auditing ZT Use case #1: Authorization Many people think that authentication and authorization can be used interchangeably. However, they both mean different things. Authentication involves the process of verifying user identities based on the credentials presented. Even though authenticated users are verified by the system, they do not necessarily have the authority to access protected applications. That is where authorization comes into play. Authorization involves the process of verifying the authority of an identity before granting access to application. Authorization in the context of OIDC authentication involves retrieving claims from user ID tokens and setting conditions to validate whether the user is authorized to enter the system. An authenticated user is granted an ID token from the IdP with specific user information through JWT claims. The configuration of these claims is typically set from the IdP. Revisiting the OIDC auth use case configured in the previous section, we can retrieve the ID tokens of authenticated users from the NGINX key-value store. $ curl -i http://localhost:8010/api/9/http/keyvals/oidc_acess_tokens Then we can view the decoded value of the ID token using jwt.io. Below is an example of decoded payload data from the ID token. { "exp": 1716219261, "iat": 1716219201, "admin": true, "name": "Micash", "zone_info": "America/Los_Angeles" "jti": "9f8ff4bd-4857-4e12-9634-e5876f786f98", "iss": "http://idp.f5lab.com:8080/auth/realms/master", "aud": "account", "typ": "Bearer", "azp": "appworld2024", "nonce": "gMNK3tu06j6tp5-jGa3aRhkj4F0P-Z3e04UfcFeqbes" } NGINX Plus has access to these claims as embedded variables. They are accessed by prefixing $jwt_claim_ to the desired field (for example, $jwt_claim_admin for the admin claim). We can easily set conditions on these claims and block unauthorized users before they even reach the back-end applications. Going back to our frontend.conf file in the previous part of our series. We can set $jwt_flag variable to 0 or 1 based on the value of the admin JWT claim. We then use the jwt_claim_require directive to validate the ID token. ID tokens with admin claims set to false will be rejected. map $jwt_claim_admin $jwt_status { "true" 1; default 0; } server { include conf.d/openid_connect.server_conf; # Authorization code flow and Relying Party processing error_log /var/log/nginx/error.log debug; # Reduce severity level as required listen [::]:443 ssl ipv6only=on; listen 443 ssl; server_name example.work.gd; ssl_certificate /etc/ssl/nginx/default.crt; # self-signed for example only ssl_certificate_key /etc/ssl/nginx/default.key; location / { # This site is protected with OpenID Connect auth_jwt "" token=$session_jwt; error_page 401 = @do_oidc_flow; auth_jwt_key_request /_jwks_uri; # Enable when using URL auth_jwt_require $jwt_status; proxy_pass https://cluster1-https; # The backend site/app } } Note: Authorization with NGINX Plus is not restricted to only JWT tokens. You can technically set conditions on a variety of attributes, such as: Session cookies HTTP headers Source/Destination IP addresses ZT use case #2: Mutual TLS Authentication (mTLS) When it comes to ZT, mTLS is one of the mainstream use cases falling under the Zero Trust umbrella. For example, enterprises are using Service Mesh technologies to stay compliant with ZT standards. This is because Service Mesh technologies aim to secure service to service communication using mTLS. In many ways, mTLS is similar to the OIDC use case we implemented in the previous section. Only here, we are leveraging digital certificates to encrypt and authenticate traffic. This underlying framework is defined by PKI (Public Key Infrastructure). To explain this framework in simple terms we can refer to a simple example; the driver's license you carry in your wallet. Your driver’s license can be used to validate your identity, the same way digital certificates can be used to validate the identity of applications. Similarly, only the state can issue valid driver's licenses, the same way only Certificate Authorities (CAs) can issue valid certificates to applications. It is also important that only the state can issue valid certificates. Therefore, every CA must have a private secure key to sign and issue valid certificates. Configuring mTLS with NGINX can be broken down in two parts: Ingress mTLS; Securing SSL client traffic and validating client certificates against a trusted CA. Egress mTLS; securing SSL upstream traffic and offloading authentication of TLS material to a trusted HTTPS back-end server. Ingress mTLS You can configure ingress mTLS on the NLK deployment by simply referencing the trusted certificate authority adding the ssl_client_certificate directive in the server context. This will configure NGINX to validate client certificates with the referenced CA. Note: If you do not have a CA, you can create one using OpenSSL or Cloudflare PKI and TLS toolkits server { listen 443 ssl; status_zone https://cafe.example.com; server_name cafe.example.com; ssl_certificate /etc/ssl/nginx/default.crt; ssl_certificate_key /etc/ssl/nginx/default.key; ssl_client_certificate /etc/ssl/ca.crt; } Egress mTLS Egress mTLS is a slight alternative to ingress mTLS where NGINX verifies certificates of upstream applications rather than certificates originating from clients. This feature can be enabled by adding the proxy_ssl_trusted_certificate directive to the server context. You can reference the same trusted CA we used for verification when configuring ingress mTLS or reference a different CA. In addition to verifying server certificates, NGINX as a reverse-proxy can pass over certs/keys and offload verification to HTTPS upstream applications. This can be done by adding the proxy_ssl_certificate and proxy_ssl_certificate_key directives in the server context. server { listen 443 ssl; status_zone https://cafe.example.com; server_name cafe.example.com; ssl_certificate /etc/ssl/nginx/default.crt; ssl_certificate_key /etc/ssl/nginx/default.key; #Ingress mTLS ssl_client_certificate /etc/ssl/ca.crt; #Egress mTLS proxy_ssl_certificate /etc/nginx/secrets/default-egress.crt; proxy_ssl_certificate_key /etc/nginx/secrets/default-egress.key; proxy_ssl_trusted_certificate /etc/nginx/secrets/default-egress-ca.crt; } ZT use case #3: Secure Assertion Markup Language (SAML) SAML (Security Assertion Markup Language) is an alternative SSO solution to OIDC. Many organizations may choose between SAML and OIDC depending on requirements and IdPs they currently run in production. SAML requires a SP (Service Provider) to exchange XML messages via HTTP POST binding to a SAML IdP. Once exchanges between the SP and IdP are successful, the user will have session access to the protected backed applications with one set of user credentials. In this section, we will configure NGINX Plus as the SP and enable SAML with the IdP. This will be like how we configured NGINX Plus as the relying party in an OIDC authorization code flow (See ZT Use case #1). Setting up the IdP The one prerequisite is setting up your IdP. In our example, we will set up the Microsoft Entra ID on Azure. You can use the SAML IdP of your choosing. Once the SAML application is created in your IdP, you can access the SSO fields necessary to link your SP (NGINX Plus) to your IdP (Microsoft Entra ID). You will need to edit the basic SAML configuration by clicking on the pencil icon next to Editin Basic SAML Configuration, as seen in the figure above. Add the following values and click Save: Identifier (Entity ID) -- https://fourth.run.place Reply URL (Assertion Consumer Service URL) -- https://fourth.run.place/saml/acs Sign on URL: https://fourth.run.place Logout URL (Optional): https://fourth.run.place/saml/sls Finally download the Certificate (Raw) from Microsoft Entra ID and save it to your NGINX Plus instance. This certificate is used to verify signed SAML assertions received from the IdP. Once the certificate is saved on the NGINX Plus instance, extract the public key from the downloaded certificate and convert it to SPKI format. We will use this certificate later when we configure NGINX Plus in the next section. $ openssl x509 -in demo-nginx.der -outform DER -out demo-nginx.der $ openssl x509 -inform DER -in demo-nginx.der -pubkey -noout > demo-nginx.spki Configuring NGINX Plus as the SAML Service Provider After the IdP is setup, we can configure NGINX Plus as the SP to exchange and validate XML messages with the IdP. Once logged into the NGINX Plus instance, simply clone the nginx SAML GitHub repo. $ git clone https://github.com/nginxinc/nginx-saml.git && cd nginx-saml Copy the config files into the /etc/nginx/conf.d directory. $ cp frontend.conf saml_sp.js saml_sp.server_conf saml_sp_configuration.conf /etc/nginx/conf.d/ Notice that by default, frontend.conf listens on port 8010 with clear text http. You can merge kube_lb.conf into frontend.conf to enable TLS termination and update the upstream context with application endpoints you wish to protect with SAML. Finally we will need to edit the saml_sp_configuration.conf file and update variables in the map context based on the parameters of your SP and IdP: $saml_sp_entity_id; https://fourth.run.place $saml_sp_acs_url; https://fourth.run.place/saml/acs $saml_sp_sign_authn; false $saml_sp_want_signed_response; false $saml_sp_want_signed_assertion; true $saml_sp_want_encrypted_assertion; false $saml_idp_entity_id; Unique identifier that identifies the IdP to the SP. This field is retrieved from your IdP $saml_idp_sso_url; This is the login URL and is also retrieved from the IdP $saml_idp_verification_certificate; Variable referencing the certificate downloaded from the previous section when setting up the IdP. This certificate will verify signed assertions received from the IdP. Use the full directory (/etc/nginx/conf.d/demo-nginx.spki) $saml_sp_slo_url; https://fourth.run.place/saml/sls $saml_idp_slo_url; This is the logout URL retrieved from the IdP $saml_sp_want_signed_slo; true The remaining variables defined in saml_sp_configuration.conf can be left unchanged, unless there is a specific requirement for enabling them. Once the variables are set appropriately, we can reload NGINX Plus. $ nginx -s reload Testing Now we will verify the SAML flow. open your browser and enter https://fourth.run.place in the address bar. This should redirect me to the IDP login page. Once you login with your credentials, I should be granted access to my protected application ZT use case #4: Monitoring/Auditing NGINX logs/metrics can be exported to a variety of 3rd party providers including: Splunk, Prometheus/Grafana, cloud providers (AWS CloudWatch and Azure Monitor Logs), Datadog, ELK stack, and more. You can monitor NGINX metrics and logs natively with NGINX Instance Manager or NGINX SaaS. The NGINX Plus API provides me a lot of flexibility by exporting metrics to any third-party tool that accepts JSON. For example, you can export NGINX Plus API metrics to our native real-time dashboard from part 1. native real-time dashboard from part 1 Whichever tool I chose, monitoring/auditing my data generated from my IT systems is key to understanding and optimizing my applications. Conclusion Cloud providers offer a convenient way to expose Kubernetes Services to the internet. Simply create Kubernetes Service of type: LoadBalancer and external users connect to your services via public entry point. However, cloud load balancers do nothing more than basic TCP/HTTP load balancing. You can configure NGINX Plus with many Zero Trust capabilities as you scale out your environment to multiple clusters in different regions, which is what we will cover in the next part of our series.196Views2likes0CommentsSecuring and Scaling Hybrid Application with F5 NGINX (Part 1)
If you are using Kubernetes in production, then you are likely using an ingress controller. The ingress controller is the core engine managing traffic entering and exiting the Kubernetes cluster. Because the ingress controller is a deployment running inside the cluster, how do you route traffic to the ingress controller? How do you route external traffic to internal Kubernetes Services? Cloud providers offer a simple convenient way to expose Kubernetes Services using an external load balancer. Simply deploy a Managed Kubernetes Service (EKS, GKE, AKS) and create a Kubernetes Service of type LoadBalancer. The cloud providers will host and deploy a load balancer providing a public IP address. External users can connect to Kubernetes Services using this public entry point. However, this integration only applies to Managed Kubernetes Services hosted by cloud providers. If you are deploying Kubernetes in private cloud/on-prem environments, you will need to deploy your own load balancer and integrate it with the Kubernetes cluster. Furthermore, Kubernetes Load Balancing integrations in the cloud are limited to TCP Load Balancing and generally lack visibility into metrics, logs, and traces. We propose: A solution that applies regardless of the underlying infrastructure running your workloads Guidance around sizing to avoid bottlenecks from high traffic volumes Application delivery use cases that go beyond basic TCP/HTTP load balancing In the solution depicted below, I deploy NGINX Plus as the external LB service for Kubernetes and route traffic to the NGINX Ingress Controller. The NGINX Ingress Controller will then route the traffic to the application backends. The NLK (NGINX Load Balancer for Kubernetes) deployment is a new controller by NGINX that monitors specified Kubernetes Services and sends API calls to manage upstream endpoints of the NGINX External Load Balancer In this article, I will deploy the components both inside the Kubernetes cluster and NGINX Plus as the external load balancer. Note: I like to deploy both the NLK and Kubernetes cluster in the same subnet to avoid network issues. This is not a hard requirement. Prerequisites The blog assumes you have experience operating in Kubernetes environments. In addition, you have the following: Access to a Kubernetes environment; Bare Metal, Rancher Kubernetes Engine (RKE), VMWare Tanzu Kubernetes (VTK), Amazon Elastic Kubernetes (EKS), Google Kubernetes Engine (GKE), Microsoft Azure Kubernetes Service (AKS), and RedHat OpenShift NGINX Ingress Controller – Deploy NGINX Ingress Controller in the Kubernetes cluster. Installation instructions can be found in the documentation. NGINX Plus – Deploy NGINX Plus on VM or bare metal with SSH access. This will be the external LB service for the Kubernetes cluster. Installation instructions can be found in the documentation. You must have a valid license for NGINX Plus. You can get started today by requesting a 30-day free trial. Setting up the Kubernetes environment I start with deploying the back-end applications. You can deploy your own applications, or you can deploy our basic café application as an example. $ kubectl apply –f cafe.yaml Now I will configure routes and TLS settings for the ingress controller $ kubectl apply –f cafe-secret.yaml $ kubectl apply –f cafe-virtualserver.yaml To ensure the ingress rules are successfully applied, you can examine the output of kubectl get vs. The VirtualServer definition should be in the Valid state. NAMESPACE NAME STATE HOST IP PORTS default cafe-vs Valid cafe.example.com Setting up NGINX Plus as the external LB A fresh install of NGINX Plus will provide the default.conf file in the /etc/nginx/conf.d directory. We will add two additional files into this directory. Simply copy the bulleted files into your /etc/nginx/conf.d directory dashboard.conf; This will enable the real-time monitoring dashboard for NGINX Plus kube_lb.conf; The nginx configuration as the external load balancer for Kubernetes. You can change the configuration file to fit your requirements. In part 1 of this series, we enabled basic routing and TLS for one cluster. You will also need to generate TLS cert/keys and place them in the /etc/ssl/nginx folder of the NGINX Plus instance. For the sake of this example, we will generate a self-signed certificate with openssl. $ openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout default.key -out default.crt -subj "/CN=NLK" Note: Using self-signed certificates is for testing purposes only. In a live production environment, we recommend using a secure vault that will hold your secrets and trusted CAs (Certificate Authorities). Now I can validate the configuration and reload nginx for the changes to take effect. $ nginx –t $ nginx –s reload I can now connect to the NGINX Plus dashboard by opening a browser and entering http://<external-ip-nginx>:9000/dashboard.html#upstreams The HTTP upstream table should be empty as we have not deployed the NLK Controller yet. We will do that in the next section. Installing the NLK Controller You can install the NLK Controller as a Kubernetes deployment that will configure upstream endpoints for the external load balancer using the NGINX Plus API. First, we will create the NLK namespace $ kubectl create ns nlk And apply the RBAC settings for the NKL deployment $ kubectl apply -f serviceaccount.yaml $ kubectl apply -f clusterrole.yaml $ kubectl apply -f clusterrolebinding.yaml $ kubectl apply -f secret.yaml The next step is to create a ConfigMap defining the API endpoint of the NGINX Plus external load balancer. The API endpoint is used by the NLK Controller to configure the NGINX Plus upstream endpoints. We simply modify the nginx-hosts field in the manifest from our GitHub repository to the IP address of the NGINX external load balancer. nginx-hosts: http://<nginx-plus-external-ip>:9000/api Apply the updated ConfigMap and deploy the NLK controller $ kubectl apply –f nkl-configmap.yaml $ kubectl apply –f nkl-deployment I can verify the NLK controller deployment is running and the ConfigMap data is applied. $ kubectl get pods –o wide –n nlk $ kubectl describe cm nginx-config –n nlk You should see the NLK deployment in status Running and the URL should be defined under nginx-hosts. The URL is the NGINX Plus API endpoint of the external load Balancer. Now that the NKL Controller is successfully deployed, the external load balancer is ready to route traffic to the cluster. The final step is deploying a Kubernetes Service type NodePort to expose the Kubernetes cluster to NGINX Plus. $ kubectl apply –f nodeport.yaml There are a couple things to note about the NodePort Service manifest. Fields on line 7 and 14 are required for the NLK deployment to configure the external load balancer appropriately: The nginxinc.io/nkl-cluster annotation The port name matching the upstream block definition in the NGINX Plus configuration (See line 42 in kube_lb.conf) and preceding nkl- apiVersion: v1 kind: Service metadata: name: nginx-ingress namespace: nginx-ingress annotations: nginxinc.io/nlk-cluster1-https: "http" # Must be added spec: type: NodePort ports: - port: 443 targetPort: 443 protocol: TCP name: nlk-cluster1-https selector: app: nginx-ingress Once the service is applied, you can note down the assigned nodeport selecting the NGINX Ingress Controller deployment. In this example, that node port is 32222. $ kubectl get svc –o wide –n nginx-ingress NAME TYPE CLUSTER-IP PORT(S) SELECTOR nginx-ingress NodePort x.x.x.x 443:32222/TCP app=nginx-ingress If I reconnect to my NGINX Pus dashboard, the upstream tab should be populated with the worker node IPs of the Kubernetes cluster and matching the node port of the nginx-ingress Service (32222). You can list the node IPs of your cluster to make sure they match the IPs in the dashboard upstream tab. $ kubectl get nodes -o wide | awk '{print $6}' INTERNAL-IP 10.224.0.6 10.224.0.5 10.224.0.4 Now I can connect to the Kubernetes application from our local machine. The hostname we used in our example (cafe.example.com) should resolve to the IP address of the NGINX Plus load balancer. Wrapping it up Most enterprises deploying Kubernetes in production will install an ingress controller. It is the DeFacto standard for application delivery in container orchestrators like Kubernetes. DevOps/NetOps engineers are now looking for guidance on how to scale out their Kubernetes footprint in the most efficient way possible. Because enterprises are embracing the hybrid approach, they will need to implement their own integrations outside of cloud providers. The solution we propose: is applicable to hybrid environments (particularly on-prem) Sizing information to avoid bottlenecks from large traffic volumes Enterprise Load Balancing capabilities that stretch beyond a TCP LoadBalancing Service In the next part of our series, I will dive into the third bullet point into much more detail and cover Zero Trust use cases with NGINX Plus, providing extra later of security in your hybrid model.486Views0likes0CommentsF5 BIG-IP per application Red Hat OpenShift cluster migrations
Overview OpenShift migrations are typically done when it is desired to minimise disruption time when performing cluster upgrades. Disruptions can especially occur when performing big changes in the cluster such as changing the CNI from OpenShiftSDN to OVNKubernetes. OpenShift cluster migrations are well covered for applications by using RedHat's Migration Toolkit for Containers (MTC). The F5 BIG-IP has the role of network redirector indicated in the Network considerations chapter. The F5 BIG-IP can perform per L7 route migration without service disruption, hence allowing migration or roll-back on a per-application basis, eliminating disruption and de-risking the maintenance window. How it works As mentioned above, the traffic redirection will be done on a per L7 route basis, this is true regardless of how these L7 routes are implemented: ingress controller, API manager, service mesh, or a combination of these. This L7 awareness is achieved by usingF5 BIG-IP's Controller Ingress Services (CIS) controller for Kubernetes/OpenShift and its multi-cluster functionality which can expose in a single VIP L7 routes of services hosted in multiple Kubernetes/OpenShift clusters. This is shown in the next picture. For a migration operation it will be used a blue/green strategy independent for each L7 route where blue will refer to the application in the older cluster and green will refer to the application in the newer cluster. For each L7 route, it will be specified a weight for each blue or green backend (like in an A/B strategy). This is shown in the next picture. In this example, the migration scenario uses OpenShift´s default ingress controller (HA proxy) as an in-cluster ingress controller where the Route CR is used to indicate the L7 routes. For each L7 route defined in the HA-proxy tier, it will be defined as an L7 route in the F5 BIG-IP tier. This 1:1 mapping allows to have the per-application granularity. The VirtualServer CR is used for the F5 BIG-IP. If desired, it is also possible to use Route resources for the F5 BIG-IP. Next, it is shown the manifests for a given L7 route required for the F5 BIG-IP, in this case, https://www.migration.com/shop (alias route-b) apiVersion: "cis.f5.com/v1" kind: VirtualServer metadata: name: route-b namespace: openshift-ingress labels: f5cr: "true" spec: host: www.migration.com virtualServerAddress: "10.1.10.106" hostGroup: migration.com tlsProfileName: reencrypt-tls profileMultiplex: "/Common/oneconnect-32" pools: - path: /shop service: router-default-route-b-ocp1 servicePort: 443 weight: 100 alternateBackends: - service: router-default-route-b-ocp2 weight: 0 monitor: type: https name: /Common/www.migration.com-shop reference: bigip --- apiVersion: v1 kind: Service metadata: annotations: name: router-default-route-b-ocp1 namespace: openshift-ingress spec: ports: - name: http port: 80 protocol: TCP targetPort: http - name: https port: 443 protocol: TCP targetPort: https selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default type: NodePort --- apiVersion: v1 kind: Service metadata: annotations: name: router-default-route-b-ocp2 namespace: openshift-ingress spec: ports: - name: http port: 80 protocol: TCP targetPort: http - name: https port: 443 protocol: TCP targetPort: https selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default type: NodePort The CIS multi-cluster feature will search for the specified services in both clusters. It is up to DevOps to ensure that blue services are only present in the cluster designated as blue (in this case router-default-route-b-ocp1) and green services are only present in the cluster designated as green (in this case router-default-route-b-ocp2). It is important to remark that the Route manifests for HA-proxy (or any other ingress solution used) doesn't require any modification. That is, this migration mechanism is transparent to the application developers. Demo You can see this feature in action in the next video The manifests used in the demo can be found in the following GitHub repository: https://github.com/f5devcentral/f5-bd-cis-demo/tree/main/crds/demo-mc-twotier-haproxy-noshards Next steps Try it today! CIS is open-source software and is included in your support entitlement. If you want to learn more about CIS and CIS multi-cluster features the following blog articles are suggested. F5 BIG-IP deployment with OpenShift - platform and networking options F5 BIG-IP deployment with OpenShift - publishing application options F5 BIG-IP deployment with OpenShift - multi-cluster architectures423Views0likes0CommentsTaming your “Chaos Monkey” with F5 Distributed Cloud Platform
Overview Recently, my family returned from a holiday trip to Japan. While the holiday itself was amazing, this article isn't about the experiences or the chaos my children caused; rather, it's about the significant role technology and applications played in enhancing our vacation and our lives in the digital world. Please also notes that “Application” in this context loosely use to refer to software/applications/AI apps/API or systems that power the digital world. Throughout our journey, we found ourselves heavily reliant on various applications, ranging from weather forecasts to navigation aids. We utilized weather apps to stay informed and dressed appropriately, GPS apps to navigate bustling cities and public transportation, and mobile payment apps for seamless transactions. Social media platforms allowed us to update family and friends on our whereabouts, while continuous access to mobile internet (via 4/5G connectivity) kept us tethered to the digital world. Additionally, we interacted with numerous indirect applications and systems, such as ordering food in cafes, different ticketing systems, or using Automated Teller Machines (ATM) for cash withdrawals. Reflecting on these travel experiences prompts consideration of the potential implications had these apps not existed or malfunctioned during our visit. While it might not have been catastrophic, it would have certainly detracted from the smoothness and enjoyment of our holiday. For instance, the failure of my mobile payment app could have hindered transactions, or what if a life-threatening event occurred and the network went down, accessing emergency services would have been impossible—a potentially catastrophic situation that I couldn’t had imagine. The crux of the matter is the paramount importance of ensuring that these applications remain always available, secure, and resilient. They have become integral to modern life, not just enhancing convenience but also playing a crucial role in safety and well-being. Therefore, efforts to maintain their reliability and functionality are imperative in navigating our increasingly digital world. In our increasingly interconnected world, reliance on technology already ubiquitous. The resilience of apps and systems is now a paramount concern for any organization, occupying the top priority in the minds of many executives (CxOs). When these systems fail, causing disruptions for customers or citizens, CxOs may find themselves compelled to respond publicly or even testify before various authorities, demonstrating their due diligence in managing and maintaining these critical assets. Hence, organization need to have strategies to assess and analyse failure mode and impact analysis of those critical application failure. Numerous methodical strategies exist to study and ensure the resilience of apps and systems, such as Failure Modes, Effects, and Criticality Analysis (FMECA), Failure Mode and Effects Analysis (FMEA), and Chaos Engineering. While the intricacies of these methodologies won't be covered in depth here, it's important to introduce them and highlight their shared objective: mitigating business/availability risks to prevent harm to business when apps or systems encounter failures. The focus of this article is to demonstrate how F5's Distributed Cloud (F5XC) Secure Multi-Cloud Networking (MCN) for Kubernetes can address some of these failure scenarios, particularly through the lens of Chaos Engineering. Chaos Engineering involves deliberately inducing failures in a controlled environment to test system resilience. In this demonstration, I’ll leverage the Open Source Chaos Engineering platform to simulate failure scenarios within a running production system. I will use a sample financial application, Arcadia Finance, as our subject for chaos testing. This application consists of microservices distributed across heterogeneous Kubernetes environments, including Amazon EKS, Azure AKS, Google GKE, and Red Hat OpenShift Container Platform (OCP). F5's XC Mesh for Kubernetes can run on any of these Kubernetes platforms and itself formed a secure mesh fabric to orchestrate apps connectivity, delivery, security and observability between those heterogenous container platform. Regardless of the specific strategy employed, the goal remains consistent: implementing risk prevention strategies to safeguard against the potential harm to business caused by app or system failures. Please do note that full end-to-end demo video at the end of this article. Below are some of the mentioned methodologies. Please refer to respective literature for details. FMECA / FMEA From “Find failure and fix it” to “anticipate failure and prevent it” Extracted from (https://www.getmaintainx.com/learning-center/what-is-fmeca-failure-mode-effects-and-critical-analysis/) FMECA is a risk assessment methodology in which you determine failure modes, assess their level of risk to your equipment or system, and rate the failure based on that level of risk. The U.S. military invented this FMECA analysis technique in the ‘40s. The military continues to use the FMECA even today under the MIL STD-1629A. FMECA is a commonly used technique for performing failure detection and criticality analysis on systems to improve their performance. In addition, it typically provides input for Maintainability Analysis and Logistics Support Analysis, both of which rely on FMECA data. With Industry 4.0, many industries are adopting a predictive maintenance strategy for their equipment. To prioritize failure modes and identify mechanical system and subsystem issues for predictive maintenance, FMECA is a widely used tool. Chaos Engineering Excerpt from https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems Note: Chaos Monkey serves as a critical tool in enhancing chaos engineering; it enables engineering teams to simulate failures across multiple configurations and monitor the system's behaviour in real time. It was a set of tools that originally open source by Netflix. In this demo, Open Source Litmus will be use instead of Chaos Monkey. Litmus Chaos Platform Litmus is an open source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way. It is a Cloud-Native Chaos Engineering Framework with cross-cloud support. It is a CNCF Incubating project with adoption across several organizations. Its mission is to help Kubernetes SREs and Developers to find weaknesses in both Non-Kubernetes as well as platforms and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated Chaos Experiments. Litmus adopts a "Kubernetes-native" approach to define chaos intent in a declarative manner via Kubernetes custom resources (CRs). Litmus platform consist of Control Plane, Execution Plane and Chaos Fault flow. Please refer to official documentation for details - https://docs.litmuschaos.io/docs/introduction/what-is-litmus Chaos Center (Chaos Control Plane) is deployed on F5XC AppStack and Chaos Execution Plane (Litmus Agents/Infrastructure) installed on respective Kubernetes Platform. Litmus agents communicate with Chaos Center via F5XC Secure Mesh Fabric. This is a Traffic Graph where Litmus agent on respective K8S communicating to Chaos Centre over websocket connection. These private connections are secured and protected by F5XC. Chaos Engineering Demo - High Level Demo Architecture In this demo environment, Litmus agents are deployed on both Amazon EKS and Red Hat OCP. Arcadia Finance, comprising multiple microservices (applications and APIs), are distributed across heterogeneous container platforms. The demo will focus on two specific use cases: Use Case #1: Frontend Application Latency Demonstrating network latency impacting frontend applications (EKS), resulting in unresponsive app behavior within critical timeframes. Use Case #2: Production Deployment Issues Showcasing the deployment of an updated version of the money-transfer API container (OCP) leading to the money-transfer API pods entering a CrashLoopBack state, hindering production functionality. Litmus (open source) is capable to inject more than 50 chaos experiment – for example, on Kubernetes; pods kill, pod delete, network latency, pod network disruption, node failure and many more. Please refer to litmus documentation for the complete list of chaos. F5 Distributed Cloud Platform Customer Edge Sites Arcadia Finance Sample Application Construct Litmus Chaos Environment for this Demo Litmus agents installed, registered and connected onto Litmus Chaos Center via F5XC Mesh Fabric. Litmus Chaos Center installed on F5's AppStack Kubernetes. Chaos Experiment created for arcadia frontend 4s network latency injected into arcadia frontend Continuous probe (health check) of the frontend to ensure application still functioning and accessible Without multi-cluster resiliency Injected chaos network latency Running Chaos Experiment workflow Logs shown on F5XC before adding multi cluster resiliency. As shown after 4s latency injected to frontend (served from foobz-mesh-eks1), user/probe unable to get to frontend and subsequent request return 503 error as no available site to handle the introduce network latency. End user will received 503 error – “application down” Chaos Experiments test completed with Failure and Resilience Score to 0%. End Result - Application unavailable. Resilient Score - 0% With multi-cluster resiliency Introduce Google Cloud GKE as part of the backup origin pool for arcadia frontend via CI/CD Pipeline. In the even if frontend on EKS unable to handle or failed, traffic will be steered/redirected to GKE. Similar Chaos experience will be run and completed successful with Resilience Score of 100%. From XC request logs shown traffic seamlessly transition from "foobz-mesh-eks1" to "foobz-mesh-gke1" End Result - Application Always Available. Resilient Score - 100% Similar backup site ("foobz-mesh-aks1") will be added via CI/CD Pipeline into the money-transfer apps/api to provide high redundancy. Deployment of rogue software onto money-transfer api pods that causes money-transfer pod into a CrashLoopBack. From XC logs, you can see that money-transfer served from foobz-ves-ocp-sg transition to foobz-mesh-aks1 seamlessly While refer-friend module still remain in foobz-ves-ocp-sg as refer-friend apps/api are healthy in foobz-ves-ocp-sg End-Result with F5 Distributed Cloud Mesh Demo Video Summary F5 is delivering on its mission to make it significantly easier to secure, deliver, and optimize any app, any API, anywhere. We strive to bring a better digital world to life. Our teams empower organization across the globe to create, secure, and run applications that enhance how we experience our evolving digital world.260Views2likes2CommentsKubernetes architecture options with F5 Distributed Cloud Services
Summary F5 Distributed Cloud Services (F5 XC) can both integrate with your existing Kubernetes (K8s) clustersand/or host aK8s workload itself. Within these distinctions, we have multiple architecture options. This article explores four major architectures in ascending order of sophistication and advantages. Architecture #1: External Load Balancer (Secure K8s Gateway) Architecture #2: CE as a pod (K8s site) Architecture #3: Managed Namespace (vK8s) Architecture #4: Managed K8s (mK8s) Kubernetes Architecture Options As K8s continues to grow, options for how we run K8s and integrate with existing K8s platforms continue to grow. F5 XC can both integrate with your existing K8s clustersand/orrun a managed K8s platform itself.Multiple architectures exist within these offerings too, so I was thoroughly confused when I first heard about these possibilities. A colleague recently laid it out for me in a conversation: "Michael, listen up: XC can eitherintegrate with your K8s platform,run insideyour K8s platform, host virtual K8s(Namespace-aaS), or run a K8s platformin your environment." I replied, "That's great. Now I have a mental model for differentiating between architecture options." This article will overview these architectures and provide 101-level context: when, how, and why would you implement these options? Side note 1: F5 XC concepts and terms F5 XC is a global platform that can provide networking and app delivery services, as well as compute (K8s workloads). We call each of our global PoP's a Regional Edge (RE). RE's are highly meshed to form the backbone of the global platform. They connect your sites, they can expose your services to the Internet, and they can run workloads. This platform is extensible into your data center by running one or more XC Nodes in your network, also called a Customer Edge (CE). A CE is a compute node in your network that registers to our global control plane and is then managed by a customer as SaaS. The registration of one or more CE's creates a customer site in F5 XC. A CE can run on ahypervisor (VMWare/KVM/Etc), a Hyperscaler (AWS, Azure, GCP, etc), baremetal, or even as a k8s pod, and can be deployed in HA clusters. XC Mesh functionality provides connectivity between sites, security services, and observability. Optionally, in addition, XC App Stack functionality allows a large and arbitrary number of managed clusters to be logically grouped into a virtual site with a single K8s mgmt interface. So where Mesh services provide the networking, App Stack services provide the Kubernetes compute mgmt. Our first 2 architectures require Mesh services only, and our last two require App Stack. Side note 2: Service-to-service communication I'm often asked how to allow services between clusters to communicate with each other. This is possible and easy with XC. Each site can publish services to every other site, including K8s sites. This means that any K8s service can be reachable from other sites you choose. And this can be true in any of the architectures below, although more granular controls are possible with the more sophisticated architectures. I'll explore this common question more in a separate article. Architecture 1: External Load Balancer (Secure K8s Gateway) In a Secure Kubernetes Gatewayarchitecture, you have integration with your existing K8s platform, using the XC node as the external load balancer for your K8s cluster. In this scenario, you create a ServiceAccount and kubeconfig file to configure XC. The XC node then performs service discovery against your K8s API server. I've covered this process in a previous article, but the advantage is that you can integrate withexisting K8s platforms. This allows exposing both NodePort and ClusterIP services via the XC node. XC is not hosting any workloads in this architecture, but it is exposing your services to your local network, or remote sites, or the Internet. In the diagram above, I show a web application being accesssed from a remote site (and/or the Internet) where the origin pool is a NodePort service discovered in a K8s cluster. Architecture 2: Run a site within a K8s cluster (K8s site type) Creating a K8s site is easy - just deploy a single manifest found here. This file deploys multiple resources in your cluster, and together these resources work to provide the services of a CE, and create a customer site. I've heard this referred to as "running a CE inside of K8s" or "running your CE as a pod". However, when I say "CE node" I'm usually referring to a discreet compute node like a VM or piece of hardware; this architecture is actually a group of pods and related resources that run within K8s to create a XC customer site. With XC running inside your existing cluster, you can expose services within the cluster by DNS name because the site will resolve these from within the cluster. Your service can then be exposed anywhere by the F5 XC platform. This is similar to Architecture 1 above, but with this model, your site is simply a group of pods within K8s. An advantage here is the ability to expose services of other types (e.g. ClusterIP). A site deployed into a K8s cluster will only support Mesh functionality and does not support AppStack functionality (i.e., you cannot run a cluster within your cluster). In this architecture, XC acts as a K8s ingress controller with built-in application security. It also enables Mesh features, such as publishing of other sites' services on this site, and publishing of this site's discovered services on other sites. Architecture 3: vK8s (Namespace-as-a-Service) If the services you use includeAppStack capabilities, then architectures #3 and #4 are possible for you.In these scenarios, our XC nodeactually runs your K8son your workloads. We are no longer integrating XC with your existing K8s platform. XCisthe platform. A simple way to run K8s workloads is to use avirtual k8s (vK8s) architecture. This could be referred to as a "managed Namespace" because by creating a vK8s object in XC you get a single namespace in a virtual cluster. Your Namespace can be fully hosted (deployed to RE's) or run on your VM's (CE's), or both. Your kubeconfig file will allow access to your Namespace via the hosted API server. Via your regular kubectl CLI (or via the web console) you can create/delete/manage K8s resources (Deployments, Services, Secrets, ServiceAccounts, etc) and view application resource metrics. This is great if you have workloads that you want to deploy to remote regions where you do not have infrastructure and would prefer to run in F5's RE's, or if you have disparate clusters across multiple sites and you'd like to manage multiple K8s clusters via a single centralized, virtual cluster. Best practice guard rails for vK8s With a vK8s architecture, you don't have your own cluster, but rather a managed Namespace. So there are somerestrictions(for example, you cannot run a container as root, bind to a privileged port, or to the Host network). You cannot create CRD's, ClusterRoles, PodSecurityPolicies, or Namespaces, so K8s operators are not supported. In short, you don't have a managed cluster, but a managed Namespace on a virtual cluster. Architecture 4: mK8s (Managed K8s) Inmanaged k8s (mk8s, also known as physical K8s or pk8s) deployment, we have an enterprise-level K8s distribution that is run at your site. This means you can use XC to deploy/manage/upgrade K8s infrastructure, but you manage the Kubernetes resources. The benefitsinclude what is typical for 3rd-party K8s mgmt solutions, but also some key differentiators: multi-cloud, with automation for Azure, AWS, and GCP environments consumed by you as SaaS enterprise-level traffic control natively allows a large and arbitrary number of managed clusters to be logically managed with a single K8s mgmt interface You can enable kubectl access against your local cluster and disable the hosted API server, so your kubeconfig file can point to a global URL or a local endpoint on-prem. Another benefit of mK8s is that you are running a full K8s cluster at your site, not just a Namespace in a virtual cluster. The restrictions that apply to vK8s (see above) do not apply to mK8s, so you could run privileged pods if required, use Operators that make use of ClusterRoles and CRDs, and perform other tasks that require cluster-wide access. Traffic management controls with mK8s Because your workloads run in a cluster managed by XC, we can apply more sophisticated and native policies to K8s traffic than non-managed clusters in earlier architectures: Service isolation can be enforced within the cluster, so that pods in a given namespace cannot communicate with services outside of that namespace, by default. More service-to-service controls exist so that you can decide which services can reach with other services with more granularity. Egress controlcan be natively enforced for outbound traffic from the cluster, by namespace, labels, IP ranges, or other methods. E.g.: Svc A can reach myapi.example.com but no other Internet service. WAF policies, bot defense, L3/4 policies,etc—allof these policies that you have typically applied with network firewalls, WAF's, etc—can be applied natively within the platform. This architecture took me a long time to understand, and longer to fully appreciate. But once you have run your workloads natively on a managed K8s platform that is connected to a global backbone and capable of performing network and application delivery within the platform, the security and traffic mgmt benefits become very compelling. Conclusion: As K8s continues to expand, management solutions of your clusters make it possible to secure your K8s services, whether they are managed by XC or exist in disparate clusters. With F5 XC as a global platform consumed as a service—not a discreet installation managed by you—the available architectures here are unique and therefore can accommodate the diverse (and changing!) ways we see K8s run today. Related Articles Securely connecting Kubernetes Microservices with F5 Distributed Cloud Multi-cluster Multi-cloud Networking for K8s with F5 Distributed Cloud - Architecture Pattern Multiple Kubernetes Clusters and Path-Based Routing with F5 Distributed Cloud8.7KViews29likes5Comments