How to Prepare Your Network Infrastructure to Add HPC Clusters for AI to Your Data Center

High Performance Computing (HPC) AI cluster infrastructures are increasingly finding their way into enterprise data centers. There are things that should be considered to avoid rearchitecting data center operations, monitoring, and security to accommodate these complex clusters. 

Introducing HPC clusters into data centers increases both the potential and danger of shifting the entire infrastructure ecosystem. When integrating AI data, tools, and policies into existing infrastructure, careful consideration must be given to maintaining operational standards for monitoring and reliability. The Kubernetes network infrastructure may also require additional scrutiny. This will help avoid outages caused by scalability limitations or security vulnerabilities.

Many decision-makers are rushing to capture GPU-powered compute clusters to deliver AI model training and inferencing capabilities, so they don’t get left behind during this hype cycle. However, these accelerated GPU hardware and requisite new integrations can inadvertently cause unplanned cascading re-architecture and require incremental auditing of their infrastructure operations and personnel. This can ultimately add unplanned and long-term costs, impacting both CapEx and OpEx.

Network Segmentation risk when placing HPC AI clusters into multi-tenant data centers

The costs to operationalize HPC AI clusters require evaluation beyond power and cooling requirements. Adding the new complex high-performance infrastructure elements within HPC clusters into established data centers requires normalizing their services to the methods and practices needed to support the enterprise overall.

One primary area which needs to be normalized is how network segmentation is required within the larger data center. Network segmentation forms the basis of data center scale of monitoring, service levels, and security. Here, global hyperscalers have invested significant resources within their infrastructures—both in products and services as well as staffing resources—to ensure basic assurance of service tenancy expected for all modern systems.

But rushing to place HPC Kubernetes-orchestrated AI clusters within multi-tenant data centers creates problems without providing a solution to provide network-level tenancy compatible with existing systems. If compatibility is not maintained, a complete infrastructure design and audit to ensure AI service performance and security would be required. This opens the doors to incremental delays, added cost, new vulnerabilities, and potential service disruptions. Know what to look for and plan ahead. 

Let's break this down into three topics in the data center:

  1. Required Kubernetes ingress
  2. Data center challenges introduced by Kubernetes multi-tenancy
  3. Optimizing GPU performance with DPU/IPU offloading

We will end with a common case study in high-performance Kubernetes ingress load balancing to demonstrate immediate value for AI training and inferencing.

(1) Kubernetes network ingress adds new challenges to the data center

One area of standardization already embedded within HPC AI cluster operations is the use of Kubernetes to orchestrate infrastructure services. Kubernetes networking was purposely designed to allow containerized processes to communicate within the cluster directly. To accomplish this, a flat network structure is maintained across each cluster. This means whenever processes want to provide a service to applications outside their cluster (or present a service to other segregated applications within their own cluster), another Kubernetes service resource is required to facilitate that ingress communication. 

Figure 1. diagram showing Kubernetes ingress into a data center

Note: Kubernetes network design requires an infrastructure component to provide network traffic ingress for clustered applications—ingress is not a default function for network traffic. This management and integration of incoming service traffic to the Kubernetes cluster is determined by the infrastructure provider. So, depending on your configuration and service partners, some components within the data center network must be added to perform this ingress role for each cluster.

Kubernetes Ingress Requires Orchestration and New Security Policy Control

So, it follows that Kubernetes ingress requires an element within EACH cluster’s control plane to allocate and configure (or orchestrate) the necessary infrastructure resources. (Context: some organizations currently manage up to 50 production clusters, per the CNCF.org 2023 annual user survey dataset). This provides an ingress path to the in-cluster containerized endpoints. To service this need, there is an entire market of third-party products to provide Ingress, called Kubernetes Ingress Controllers. Adding even more complexity, each Kubernetes Ingress Controller vendor brings its own set of service delivery, scaling and data center networking integrations, which requires support by the infrastructure NetOps (network operations) teams. As ingress represents a strategic control point of policy for all the service orchestrated within the Kubernetes clusters, security best practice calls for the SecOps (security operations) team to secure the ingress orchestration by inserting firewall and monitoring capabilities leveraging the scope of the rest of the data center infrastructure.

Kubernetes Ingress Controller solutions are operationalized through the DevOp (developer operations) teams declaration of Kubernetes service requirements as part of their CI/CD pipelines. The standard Kubernetes service declarations for ingress are defined by the following standard Kubernetes resources:

 Kubernetes API  Data Center Networking Services
 LoadBalancer Port-based service delivery (L4 TCP) using network address translation to provide access to service endpoint ports within the cluster
 Ingress A proxy-based service providing HTTP service delivery, which includes HTTP routing, TLS termination, and virtual hosting capabilities for service endpoints within the cluster
 Gateway A proxy-based, newly released standard, which provides a fully range of extensible application delivery services including role-based configurations for advanced routing of TCP, UDP, HTTP, HTTP/2 (gRPC) service traffic for service endpoints within the cluster


F5’s expertise is in network load balancing through our F5 BIG-IP product suite. BIG-IP has evolved its networking infrastructure, software and hardware (appliances, chassis, virtual machines) to provide intelligent application delivery and security services. These proven network functions are now available in cloud-native containerized form factors and are currently in production managing traffic at the scale of hundreds of Gbps (gigabits per second) through direct integration with data center networking fabrics. Adding to the options of ingress vendors, BIG-IP offers two Kubernetes Ingress Controllers—BIG-IP Container Ingress is for hardware and VM deployments, and BIG-IP Next Service Proxy for Kubernetes is for cloud-native deployments. Both offerings provide secured ingress for Kubernetes clustered applications.

Note: as Kubernetes networking ingress is a requirement for any Kubernetes cluster, some server OEMs who provide pre-bundled bill of materials (BOMs) for Kubernetes clusters often include F5 BIG-IP appliances and chassis-based products to fill the ingress implementer role. BIG-IP is a well-understood and accepted component in enterprise data centers for both NetOps and SecOps teams. BIG-IP forms the basis to normalize Kubernetes cluster ingress, networking and security to the larger data center.

(2) Single-tenancy, and multi-tenancy in the data center: F5’s Kubernetes “Ball of Fire”

Another set of hidden challenges becomes apparent when cluster operational complexity grows. Due to the way Kubernetes tenancy has typically been deployed in most data center environments, these complex challenges have thus far been masked.

In Kubernetes, the nodes which run containerized applications provide inter-host routing for the cluster by having each node maintain a NodeIP, which is the IP address of the server treated like a node-- a network address which can be routed within the data center’s underlying network fabric. (NodeIP is different from each cluster’s internal ClusterIP, which are addresses facilitating the flat direct routing for services within a cluster.) In inter-host routing, when traffic needs to egress from a given node to another node, or to access resources outside the cluster, the traffic is sourced from the infrastructure routable NodeIP for whichever cluster node is hosting a particular application container instance. At first glance, this seems to be a good, distributed network design, but the issue is it removes a key control point required in most data center operations.

Figure 2. Basic Kubernetes egress from NodeIPs

Imagine designating network monitoring and security in the larger data center for Kubernetes-hosted applications. If a whole Kubernetes cluster can be allocated through bare-metal deployment of virtual machines for each segregated security tenant of the data center, we have no problem--all the NodeIP addresses for that dedicated cluster belong to one specific cluster owner, thus one data center tenant.

Figure 3. Kubernetes egress when each tenant gets its own Kubernetes cluster

In this simple model where each cluster is assigned to one tenant, NetOps teams understand how to allocate those addresses to that cluster owner. Firewall security and monitoring can identify the data center tenant simply by the network segmentation required to route their traffic within the infrastructure. SecOps teams can build monitoring and security based on this simple network allocation scheme. When each cluster represents one tenant, we maintain existing data center operations. In this configuration, using lots of clusters, while proliferating Kubernetes everywhere, keeps the resources distributed, scalable, and secure.

However, what happens when the resources inside the Kubernetes cluster need to support multi-tenancy from a data center tenancy perspective? In this case, which is the security case for HPC AI clusters, multiple applications with different data center tenants are hosted within the same cluster. This means egress traffic for multiple data center tenants can now be sourced from the same NodeIP whenever traffic leaves any given host node. This fact masks the needed network details from generations of high-performance monitoring and security tools required in the data center infrastructure. It creates what F5 calls the “Ball of Fire”.

Figure 4. Multi-tenant Kubernetes clusters make identifying network traffic for monitoring, service levels, and security a mess

 

How Telecoms Successfully Deliver Multi-Tenant Kubernetes Clusters in Data Centers

Deploying multi-tenant Kubernetes clusters was solved by network service providers (telecoms) around the world—out of necessity—in order to adopt the standards for 5G services and applications. Critical network functions shifted form factors from VMs to cloud-native network functions (CNFs) such as CGNAT, DDoS, Firewall, Policy Manager, and aggregated from numerous vendors and sources. Each CNF from multiple vendors therefore adds another unique challenge to segregate and secure in the broader, external to the Kubernetes cluster, network context.

While F5 BIG-IP has years of operational experience integrating at scale within their network fabrics for both their IT and telco clouds, there was no Kubernetes standard to handle egress tenancy.

An F5 customer and early-adopter Tier 1 service provider requested changes to BIG-IP in order to build out a widescale 5G infrastructure on an aggressive time schedule, with new requirements:  

  • Distributed in a containerized form factor that itself could be managed and controlled by Kubernetes
  • Additionally, support network and application protocols, application delivery features, and security functionality, which existing Kubernetes networking architecture does not address

In short, they needed a Kubernetes networking infrastructure service, which would normalize their Kubernetes cluster deployments to their data center infrastructure, while at the same time maintaining the ‘swiss army knife’ scale and functionality, for both ingress and egress. This functionality set is already provided by legacy F5 BIG-IP appliances and chassis already in deployment and needed transference to the modern form factor.

This containerization development of F5 BIG-IP functions generated a new iteration of services labeled F5 BIG-IP Next. A Kubernetes resources-based control plane was needed, along with a deeper infrastructure integration with the internal of the Kubernetes clusters themselves. This complex set of requirements did not exist in the industry prior to this customer request, so F5 developed BIG-IP Next Service Proxy for Kubernetes (SPK) to specifically fit this functional gap in Kubernetes.

Figure 5. BIG-IP Next SPK was designed to fit the requirements of 5G CNF roll outs

SPK uniquely provides a distributed implementation of BIG-IP, controlled as a Kubernetes resource, which understands both Kubernetes namespace-based tenancy and the network segregation tenancy required by the data center networking fabric. BIG-IP Next SPK lives both inside the Kubernetes clusters as well as inside the data center network fabric. SPK provides MAC (L2 networking) all the way to application (L7 networking) level control for all traffic ingress or egressing Kubernetes clusters. SPK functions not just as the required Kubernetes Ingress Controller, but also, through declared custom resource definitions (CRD) as a policy and security engine to normalize multi-tenant clustered application to the wider data center and global network at telecom speeds and scale. SPK was the key component for a global telecom to achieve multi-tenant scale and manage complexity in a Kubernetes framework to deliver 5G.

Multiple teams can take advantage of this advanced functionality:

  • DevOps teams can continue to use standard Kubernetes resource declarations to deploy application from their tested CI/CD pipelines
  • NetOps teams in the data center can dictate that all traffic from a namespace within the cluster must egress from specific VLANs, VxLAN, interface VRFs, or IPv4 or IPv6 subnets. NetOps teams continue to define the required service levels based on this network segregation
  • SecOps teams use the inherent security and monitoring found in BIG-IP in conjunction with their other security controls to ensure secure application delivery. 

This preserves data center operations. It did this without fundamentally breaking the Kubernetes networking model and forcing containers to live on underlying data center networks. It got rid of the need for new security implementations and the subsequent re-auditing process adopting them would require. This value cannot be overstated when it comes to deploying complex Kubernetes clusters into established data centers quickly.

BIG-IP Next SPK is now in production for tens of millions of mobile subscribers’ traffic every day across global networks. With the scale and speed of such massive network deployments, managing outages in a critical piece as Kubernetes ingress or egress cannot be accomplished as a side project for the infrastructure team or as a feature add-on for a firewall vendor. Reliability, scaling, and load balancing must be in the core DNA of the network stack. And with these newly scaled capabilities for Kubernetes in distributed computing environments, it’s how SPK is ready to deliver for AI workloads.

Figure 6. Kubernetes egress with proper data center network segregation

 

(3) Optimize GPU with DPU/IPU Offloads for HPC AI Kubernetes network ingress and egress services

Our third infrastructure pain point is maximizing multiple GPU compute performance and scale as HPC AI clusters are introduced into IT data centers originally designed for typical web service and client server compute workloads. By design, to accommodate the super-HPC scale, these new HPC AI clusters have inter-service (east-west) networking requirements, which can reach the equivalent bandwidths needed to deliver mobile traffic for whole geographic continents. The networking bandwidths within HPC AI clusters are staggering.

These networking requirements were introduced to facilitate the use of:

  • Remote Direct Memory Access (RDMA)
  • Nonvolatile memory express (NVMe) over Fabrics

Protocols as a data busses between nodes (east-west). These protocols utilize very high-bandwidth, non-blocking network architectures that allow one computer to directly access data from another across the network without expensive OS stacks or CPU cycles being used to slowly keep track of things. This significantly lowers latency and ensures the fastest response times to data for AI workloads and allows clusters of GPUs to copy data between themselves using extensions to their own chip-to-chip data technologies. The network fabric is functioning as the new backplane for the whole HPC AI cluster.

This HPC supercomputing cluster is opaque operationally as a ‘lego block’ within the larger data center. Not surprisingly, the technical requirements in HPC AI cluster design are very tight and non-negotiable when tied to specific hardware decisions. Extending RAM, storage, and proprietary chip-to-chip technology across the network is not a simple task and must be highly engineered. This is not news to the HPC community but is new for most enterprises or network service operator teams. While the protocols used certainly aren’t new, how they are implemented by specific hardware in HPC AI clusters is alarming in its growing scope. If HPC AI cluster proliferation is the new normal, then the opaque nature of their networking will be driving significant cost and operational challenges in the near future.

Programmable SuperNIC data processing unit/interface processing unit (DPU/IPU) are replacing the HPC AI cluster node NICs to facilitate connectivity within these highly engineered network fabrics. These new DPUs don’t just include the necessary network switching technologies to connect to the 200Gbps/400Gbps ports on the non-blocking network switches, but also include hardware accelerators for nVME, connection, compression, encryption, and other offloads. But like their NIC predecessors, DPU/IPUs are still compatible with x86 and Arm hosts, which opens a new range of flexible functionality.

Kubernetes host networking stacks are quickly being optimized to take advantage of the DPU/IPU accelerators. The de facto Open vSwitch (OVS) Linux networking stack has implementations of connection offloading for multiple DPU/IPU vendor accelerators, allowing for high-speed networking flows between Kubernetes ClusterIPs for east-west traffic.

Implementing ingress and egress services for clusters is using 20–30% of the HPC AI node compute

We’re observing that ingress and node-level service-to-service networking takes a significant amount of cluster compute resources when performed by software-based networking stacks running on each—or across a set of—cluster nodes. To optimize performance, networking software pins itself to specific processing cores and pre-allocates memory to process network flows. These resources appear totally consumed and unavailable to the HPC AI cluster host. It is not an understatement that between 20–30% of cluster host compute could be expended by network software simply getting traffic in and out of the cluster nodes. 

The compute footprint in the HPC AI data center should ideally be instead focused on AI application services—which requires the parallel stream and tensor core processing driving the deployment of expensive GPUs in the first place. For every CPU host cycle that is expended providing infrastructure services, like ingress/egress networking, we starve AI workloads that keep the expensive GPU resources busy. That’s when the TCO calculations, which justify the GPU hardware expenditure and new cluster expensive non-blocking networking components tip even more towards the red in terms of efficiency, cost, and ROI. (Somewhere a CFO just pulled some hair from their head.) 

Kubernetes ingress and egress services, and their security, are prime targets for DPU/IPU network accelerator offloads. The DPU/IPUs are being placed inside the HPC AI cluster for their own reasons, namely RDMA and NVMe offloads. However, the same offloads can be utilized for ingress/egress network processing, thus optimizing compute for the efficient utilization of GPUs.

 

Customer Use Case: High-Performance and Scalable Kubernetes Load Balancing for S3 HPC AI Cluster Storage Access

Even before the proliferation of HPC AI clusters hit the data center world, there was already a fundamental AI use case which demonstrates the value of accelerated and intelligent BIG-IP application-level delivery. AI model training, or retraining, requires data, lots of data. Moving data into HPC AI cluster storage is largely handled through the use of object storage APIs. Data is replicated from various tiers of object storage sources and copied into clustered file technology, which can provide high-speed access to data for GPU stream processing.

The most widely deployed object storage API is S3 (Simple Storage Service), a cloud object storage API pioneered by AWS. S3 uses HTTP REST API methods where HTTP objects represent file buckets (folders) and files. The S3 services translate HTTP requests to storage requests, which maintain efficient reading and writing of data across devices, as well as security permissions. There are numerous implementations of S3-compatible APIs available, either as containerized HTTP microservices, which front attached storage devices. Or as hosted HTTP endpoints in storage vendor’s appliance arrays.

Figure 7. S3 load balancing is a fundamental AI use case

F5 BIG-IP hardware-accelerated appliances and chassis already load balance many S3 deployments, allowing for intelligent routing of storage object requests. This is typically done through the publishing of multiple service endpoints, where each is represented by a separate hostname. Resiliency and scale is handled by L4 accelerated connection load balancing. This is the simplest and highest-scale solution, but not the only one available in BIG-IP. Alternatively, S3 HTTP requests can be processed with BIG-IP evaluating the HTTP Host header, path, and query parameters. All this intelligence can be used as ways to load-balance S3 traffic to specific endpoints. TLS offload is also an obvious choice because of hardware acceleration. 

There is another point of value for the AI S3 use case. The S3 client libraries are built to support high concurrency through threading. The load-balancing solution must therefore also be able to handle very high levels of connection concurrency efficiently. This is all part of understanding the task of load balancing S3, and both BIG-IP for L4 connections or L7 HTTP request load balancing of S3 traffic support the highest scale in the industry. S3 load balancing is a task BIG-IP was purpose-built to perform.

 

Distributed Application Delivery for HPC AI Clusters is Available Today

The ability to hardware accelerate HPC, AI cluster ingress and egress network services on deployed DPU/IPUs is available in BIG-IP Next SPK today. The DPU/IPU accelerated solution is not a new limited version of F5’s data plane, but rather the full BIG-IP stack. That means access to a wide range of functions for simplified AI service deployments with BIG-IP—for both reverse proxy ingress and forward proxy egress—is available as a key functional component of your HPC AI cluster deployment. These application delivery and security functions are automatically inline and efficient as they are part of the same network stack that is providing the required ingress functionality.

An additional benefit of locating offload capability for the ingress and egress networking so close to the HPC AI cluster network hardware itself is that external services, running on more traditional and less costly computing services, and can also be injected into the AI application path without complex service chaining orchestrations. This provides an obvious point of network integration for data observability features needed for privacy and compliance, AI API gateway features, and new security points. Because BIG-IP Next SPK can map the HPC AI cluster namespace tenancy to data center network tenancy, these external products from F5 and others can be placed inline without requiring deep integration into the HPC AI clusters themselves. Policies can be based on the network segmentation provided by F5 for the cluster, not restricted to the specifics of a given GPU-generation of HPC AI cluster.

Using hardware offload capabilities for networking, application delivery, and security can be complicated and requires significant levels of testing to ensure scale and support. F5 remains committed to a vision of a more open infrastructure for offload services through their work in the Open Programmable Infrastructure (OPI) initiative, which F5 helped found as part of the Linux Foundation in 2022. OPI’s goal remains the open-source democratization of APIs and programmable SmartNICs for acceleration to promote wider adoption of hardware acceleration for the broader software community.

The reality, however, is that differentiated hardware offloads with proprietary APIs will continue to forge the cutting edge of the performance computing market. No one understands this better than the HPC community. Integrating a dedicated ingress and egress architecture early, which is proven at scale and is headed by a vendor that is constantly engaged in this market sets a direction which can steer your HPC AI cluster deployments away from both data center and financial obstacles to avoid slowing down your AI application rollouts and adoption.

To talk to an F5 representative, Contact Us and put in the text box note you’d like to discuss AI HPC clusters.

Published Sep 25, 2024
Version 1.0
No CommentsBe the first to comment